Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide: Explore, transform, validate, and integrate your data with ease

eBook
€32.99 €22.99
Print
€41.99
Subscription
€14.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 4, 2010
Length 492 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781847199546
Vendor :
Pentaho
Category :
Table of content icon View table of contents Preview book icon Preview Book

Pentaho 3.2 Data Integration: Beginner's Guide

Chapter 1. Getting Started with Pentaho Data Integration

Pentaho Data Integration is an engine along with a suite of tools responsible for the processes of extracting, transforming, and loading—best known as the ETL processes. This book is meant to teach you how to use PDI.

In this chapter you will:

  • Learn what Pentaho Data Integration is

  • Install the software and start working with the PDI graphical designer

  • Install MySQL, a database engine that you will use when you start working with databases

Pentaho Data Integration and Pentaho BI Suite


Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

  • Analysis: The analysis engine serves multidimensional analysis. It's provided by the Mondrian OLAP server and the JPivot library for navigation and exploring.

  • Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on) from different kinds of sources. The reports created in Pentaho are based mainly in the JFreeReport library, but it's possible to integrate reports created with external reporting libraries such as Jasper Reports or BIRT.

  • Data Mining: Data mining is running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project.

  • Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). A set of tools incorporated to the BI Suite in the latest version allows users to create interesting dashboards, including graphs, reports, analysis views, and other Pentaho content, without much effort.

  • Data integration: Data integration is used to integrate scattered information from different sources (applications, databases, files) and make the integrated information available to the final user. Pentaho Data Integration—our main concern—is the engine that provides this functionality.

All this functionality can be used standalone as well as integrated. In order to run analysis, reports, and so on integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services such as authentication, scheduling, security, and web services.

This set of software and services forms a complete BI Platform, which makes Pentaho Suite the world's leading open source Business Intelligence Suite.

Exploring the Pentaho Demo

Despite being out of the scope of this book, it's worth to briefly introduce the Pentaho Demo. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles.

The demo can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. The file you have to download is named biserver-ce-3.5.2.stable.zip for Windows and biserver-ce-3.5.2.stable.tar.gz for other systems.

In the same folder you will find a file named biserver-getting_started-ce-3.5.0.pdf. The file is a guide that introduces you the platform and gives you some guidance on how to install and run it. The guide even includes a mini tutorial on building a simple PDI input-output transformation.

Note

You can find more about Pentaho BI Suite at www.pentaho.org.

Pentaho Data Integration


Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is no exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn't come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now, but from KDE Extraction, Transportation, Transformation and Loading Environment, as the tool was planned to be written on top of KDE, as mentioned in the introduction of the book.

In April 2006 the Kettle project was acquired by the Pentaho Corporation and Matt Casters, Kettle's founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment the tool has grown constantly. Every few months a new release is available, bringing to the users, improvements in performance and existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

  • June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. Among other changes, the version included enhancements for large scale environments and multilingual capabilities.

  • February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support (more on this in Chapter 13), enhanced database support, and a single designer for the two main elements you design in Kettle—jobs and transformations.

  • May 2007: PDI 2.5 is released including many new features, the main feature being the advanced error handling.

  • November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel also changed completely.

  • October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new functionalities as well.

  • April 2009: PDI 3.2 is released with a really large number of changes for a minor version—new functionality, visualization improvements, performance improvements, and a huge pile of bug fixes. The main change in this version was the incorporation of dynamic clustering (see Chapter 13 for details).

  • In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to enterprise features such as version control.

Note

Most users still refer to PDI as Kettle, its further name. Therefore, the names PDI, Pentaho Data Integration, and Kettle will be used interchangeably throughout the book.

Using PDI in real world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used for ETL processes which, as you may know, are most frequently seen in data warehouse environments.

In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool that it is common to see it used for those and for many other purposes. Here you have some examples.

Loading datawarehouses or datamarts

The loading of a datawarehouse or a datamart involves many steps, and there are many variants depending on business area or business rules. However, in every case, the process involves the following steps:

  • Extracting information from one or different databases, text files, and other sources. The extraction process may include the task of validating and discarding data that doesn't match expected patterns or rules.

  • Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.

  • Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following sample screenshot shows a simple ETL designed with Kettle:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP application and a CRM application, though they're not connected. These are just two of hundreds of examples where data integration is needed. Integrating data is not just a matter of gathering and mixing data; some conversions, validation, and transport of data has to be done. Kettle is meant to do all those tasks.

Data cleansing

Why do we need that data be correct and accurate? There are many reasons—for the efficiency of business, to generate trusted conclusions in data mining or statistical studies, to succeed when integrating data, and so on. Data cleansing is about ensuring that the data is correct and precise. This can be ensured by verifying if the data meets certain rules, discarding or correcting those that don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on—tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company of any size that uses a commercial ERP application. One day the owners realize that the licences are consuming an important share of its budget and so they decide to migrate to an open source ERP. The company will no longer have to pay licences, but if they want to do the change, they will have to migrate the information. Obviously it is not an option to start from scratch, or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kinds of sources and destinations such as plain files, and commercial and free databases and spreadsheets.

Exporting data

Sometimes you are forced by government regulations to export certain data to be processed by legacy systems. You can't just print and deliver some reports containing the required data. The data has to have a rigid format, with columns that have to obey some rules (size, format, content), different records for heading and tail, just to name some common demands. Kettle has the power to take crude data from the source and generate these kinds of ad hoc reports.

Integrating PDI using Pentaho BI

The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used as part of a process inside the Pentaho BI Platform. There are many things embedded in the Pentaho application that Kettle can do—preprocessing data for an on-line report, sending mails in a schedule fashion, or generating spreadsheet reports.

Note

You'll find more on this in Chapter 13. However, the use of PDI integrated with the BI Suite is beyond the scope of this book.

Pop quiz – PDI data sources

Which of the following aren't valid sources in Kettle:

  1. Spreadsheets

  2. Free database engines

  3. Commercial database engines

  4. Flat files

  5. None of the above

Installing PDI


In order to work with PDI you need to install the software. It's a simple task; let's do it.

Time for action – installing PDI


These are the instructions to install Kettle, whatever your operating system.

The only prerequisite to install PDI is to have JRE 5.0 or higher installed. If you don't have it, please download it from http://www.javasoft.com/ and install it before proceeding. Once you have checked the prerequisite, follow these steps:

  1. From http://community.pentaho.com/sourceforge/ follow the link to Pentaho Data Integration (Kettle). Alternatively, go directly to the download page http://sourceforge.net/projects/pentaho/files/Data Integration.

  2. Choose the newest stable release. At this time, it is 3.2.0.

  3. Download the file that matches your platform. The preceding screenshot should help you.

  4. Unzip the downloaded file in a folder of your choice—C:/Kettle or /home/your_dir/kettle.

  5. If your system is Windows, you're done. Under UNIX-like environments, it's recommended that you make the scripts executable. Assuming that you chose Kettle as the installation folder, execute the following command:

    cd Kettle
    chmod +x *.sh
    

What just happened?

You have installed the tool in just a few minutes. Now you have all you need to start working.

Pop quiz – PDI prerequisites

Which of the following are mandatory to run PDI? You may choose more than one option.

  1. Kettle

  2. Pentaho BI platform

  3. JRE

  4. A database engine

Launching the PDI graphical designer: Spoon


Now that you've installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let's see how it feels to work with it.

Time for action – starting and customizing Spoon


In this tutorial you're going to launch the PDI graphical designer and get familiarized with its main features.

  1. Start Spoon.

    • If your system is Windows, type the following command:

      Spoon.bat
      
    • In other platforms such as Unix, Linux, and so on, type:

      Spoon.sh
      
    • If you didn't make spoon.sh executable, you may type:

      sh Spoon.sh
      
  2. As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click the No Repository button. The main window appears. You will see a small window with the tip of the day. After reading it, close that window.

  3. A welcome! window appears with some useful links for you to see.

  4. Close the welcome window. You can open that window later from the main menu.

  5. Click Options... from the Edit menu. A window appears where you can change various general and visual characteristics. Uncheck the circled checkboxes:

  6. Select the tab window Look Feel.

  7. Change the Grid size and Preferred Language settings as follows:

  8. Click the OK button.

  9. Restart Spoon in order to apply the changes. You should neither see the repository dialog, nor the welcome window. You should see the following screen instead:

What just happened?

You ran for the first time the graphical designer of PDI Spoon, and applied some custom configuration.

From the Look Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. In the Option tab window, you chose not to show either the repository dialog or the welcome window at startup. These changes were applied as you restarted the tool, not before.

The second time you launched the tool, the repository dialog didn't show up. When the main window appeared, all the visible texts were shown in French, which was the selected language, and instead of the welcome window, there was a blank screen.

Spoon

This tool that you're exploring in this section is the PDI's desktop design tool. With Spoon you design, preview, and test all your work, that is, transformations and jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. The other PDI components that you will meet in the following chapters are executed from terminal windows.

Setting preferences in the Options window

In the tutorial you changed some preferences in the Options window. There are several look and feel characteristics you can change beyond those you changed. Feel free to experiment with this setting.

Note

Remember to restart Spoon in order to see the changes applied.

Tip

If you choose any language as preferred language other than English, you should select a different language as alternative. If you do so, every name or description not translated to your preferred language will be shown in the alternative language.

Just for the curious people: Italian and French are the overall winners of the list of languages to which the tool has been translated from English. Below them follow Korean, Argentinean Spanish, Japanese, and Chinese.

One of the settings you changed was the appearance of the welcome window at start up. The welcome window has many useful links, all related with the tool: wiki pages, news, forum access, and more. It's worth exploring them.

Tip

You don't have to change the settings again to see the welcome window. You can open it from the menu Help | Show the Welcome Screen.

Storing transformations and jobs in a repository

The first time you launched Spoon, you chose No Repository. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why not to use it. Let's explain it.

As said, the results of working with PDI are Transformations and Jobs. In order to save the Transformations and Jobs, PDI offers two methods:

  • Repository: When you use the repository method you save jobs and transformations in a repository. A repository is a relational database specially designed for this purpose.

  • Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension kjb and ktr respectively.

The following diagram summarizes this:

You cannot mix the two methods (files and repository) in the same project. Therefore, you must choose the method when you start the tool.

Why did we choose not to work with repository, or in other words, to work with files? This is mainly for the following two reasons:

  • Working with files is more natural and practical for most users.

  • Working with repository requires minimum database knowledge and that you also have access to a database engine from your computer. Having both preconditions would allow you to learn working with both methods. However, it's probable that you haven't.

Throughout this book, we will use the file method. For details of working with repositories, please refer to Appendix A.

Creating your first transformation

Until now, you've seen the very basic elements of Spoon. For sure, you must be waiting to do some interesting task beyond looking around. It's time to create your first transformation.

Time for action – creating a hello world transformation


How about starting by saying Hello to the World? Not original but enough for a very first practical exercise. Here is how you do it:

  1. Create a folder named pdi_labs under the folder of your choice.

  2. Open Spoon.

  3. From the main menu select File | New Transformation.

  4. At the left-hand side of the screen, you'll see a tree of Steps. Expand the Input branch by double-clicking it.

  5. Left-click the Generate Rows icon.

  6. Without releasing the button, drag-and-drop the selected icon to the main canvas. The screen will look like this:

  7. Double-click the Generate Rows step that you just put in the canvas and fill the text boxes and grid as follows:

  8. From the Steps tree, double-click the Flow step.

  9. Click the Dummy icon and drag-and-drop it to the main canvas.

  10. Click the Generate Rows step and holding the Shift key down, drag the cursor towards the Dummy step. Release the button. The screen should look like this:

  11. Right-click somewhere on the canvas to bring up a contextual menu.

  12. Select New note. A note editor appears.

  13. Type some description such as Hello World! and click OK.

  14. From the main menu, select Transformation | Configuration. A window appears to specify transformation properties. Fill the Transformation name with a simple name as hello_world. Fill the Description field with a short description such as My first transformation. Finally provide a more clear explanation in the Extended description text box and click OK.

  15. From the main menu, select File | Save.

  16. Save the transformation in the folder pdi_labs with the name hello_world.

  17. Select the Dummy step by left-clicking it.

  18. Click on the Preview button in the menu above the main canvas.

  19. A debug window appears. Click the Quick Launch button.

  20. The following window appears to preview the data generated by the transformation:

  21. Close the preview window and click the Run button.

  22. A window appears. Click Launch.

  23. The execution results are shown in the bottom of the screen. The Logging tab should look as follows:

What just happened?

You've just created your first transformation.

First, you created a new transformation. From the tree on the left, you dragged two steps and drop them into the canvas. Finally, you linked them with a hop.

With the Generate Rows step, you created 10 rows of data with the message Hello World!. The Dummy step simply served as a destination of those rows.

After creating the transformation, you did a preview. The preview allowed you to see the content of the created data, this is, the 10 rows with the message Hello World!

Finally, you ran the transformation. You could see the results of the execution at the bottom of the windows. There is a tab named Step Metrics with information about what happens with each steps in the transformation. There is also a Logging tab showing a complete detail of what happened.

Directing the Kettle engine with transformations

As shown in the following diagram, transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows. The data enters or is created in a step, the step applies some kind of transformation to it, and finally the data leaves that step. Therefore, it's said that a transformation is data-flow oriented.

A transformation itself is not a program nor an executable file. It is just plain XML. The transformation contains metadata that tells the Kettle engine what to do.

A step is the minimal unit inside a transformation. A big set of steps is available. These steps are grouped in categories such as the input and flow categories that you saw in the example. Each step is conceived to accomplish a specific function, going from reading a parameter to normalizing a dataset. Each step has a configuration window. These windows vary according to the functionality of the steps and the category to which they belong. What all steps have in common are the name and description:

Step property

Description

Name

A representative name inside the transformation.

Description

A brief explanation that allows you to clarify the purpose of the step. It's not mandatory but it is useful.

A hop is a graphical representation of data flowing between two steps—an origin and a destination. The data that flows through that hop constitutes the output data of the origin step and the input data of the destination step.

Exploring the Spoon interface

As you just saw, the Spoon is the tool using which you create, preview, and run transformations. The following screenshot shows you the basic work areas:

Note

The words canvas and work area will be used interchangeably throughout the book.

Viewing the transformation structure

If you click the View icon in the upper left corner of the screen, the tree will change to show the structure of the transformation currently being edited.

Running and previewing the transformation

The Preview functionality allows you to see a sample of the data produced for selected steps. In the previous example, you previewed the output of the Dummy Step. The Run option effectively runs the whole transformation.

Whether you preview or run a transformation, you'll get an execution results window showing what happened. Let's explain it through an example.

Time for action – running and previewing the hello_world transformation


Let's do some testing and explore the results:

  1. Open the hello_world transformation.

  2. Edit the Generate Rows step, and change the limit from 10 to 1000 so that it generates 1,000 rows.

  3. Select the Logging tab window at the bottom of the screen.

  4. Click on Run.

  5. In the Log level drop-down list, select RowLevel detail.

  6. Click on Launch.

  7. You can see how the logging window shows every task in a very detailed way.

  8. Edit the Generate Rows step, and change the limit to 10,000 so that it generates 10,000 rows.

  9. Select the Step Metrics.

  10. Run the transformation.

  11. You can see how the numbers change as the rows travel through the steps.

What just happened?

You did some tests with the hello_world transformation and saw the results in the Execution Results window.

Previewing the results in the Execution Results window

The Execution Results window shows you what is happening while you preview or run a transformation.

The Logging tab shows the execution of your transformation, step by step. By default, the level of the logging detail is Basic but you can change it to see different levels of detail—from a minimal logging (level Minimal) to a very detailed one (level RowLevel).

The Step Metrics tab shows, for each step of the transformation, the executed operations and several status and information columns. You may be interested in the following columns:

Column

Description

Read

Contains the number of rows coming from previous steps

Written

Contains the number of rows leaving from this step toward the next

Input

Number of rows read from a file or table

Output

Number of rows written to a file or table

Errors

Errors in the execution. If there are errors, the whole row becomes red

Active

Tells the current status of the execution

In the example, you can see that the Generate Rows step writes rows, which then are read by the Dummy step. The Dummy step also writes the same rows, but in this case those go nowhere.

Pop quiz – PDI basics

For each of the following, decide if the sentence is true or false:

  1. There are several graphical tools in PDI, but Spoon is the most used.

  2. You can choose to save Transformations either in files or in a database.

  3. To run a Transformation, an executable file has to be generated from Spoon.

  4. The grid size option in the Look and Feel windows allows you to resize the work area.

  5. To create a transformation, you have to provide external data.

Installing MySQL


Before skipping to the next chapter, let's devote some minutes to the installation of MySQL.

In Chapter 8 you will begin working with databases from PDI. In order to do that, you will need access to some database engine. As MySQL is the world's most popular open source database, it was the database engine chosen for the database-related tutorials in the book.

In this section you will learn to install the MySQL database engine both in Windows and Ubuntu, the most popular distribution of Linux these days. As the procedures for installing the software are different, a separate explanation is given for each system.

Time for action – installing MySQL on Windows


In order to install MySQL on your Windows system, please follow these instructions:

  1. Open an internet browser and type http://dev.mysql.com/downloads/mysql/.

  2. Select the Microsoft Windows platform and download the mysql-essential package that matches your system: 32-bit or 64-bit.

  3. Double-click the downloaded file. A wizard will guide you through the process.

  4. When asked about the setup type, select Typical.

  5. Several screens follow. When the wizard is complete you'll have the option to configure the server. Check Configure the MySQL Server now and click Finish.

  6. A new wizard will be launched that lets you configure the server.

  7. When asked about the configuration type, select Standard Configuration.

  8. When prompted, set the Windows options as shown in the next screenshot:

  9. When prompted for the security options, provide a password for the root user.You'll have to retype the password.

    Tip

    Provide a password that you can remember. You'll need it later to connect to the MySQL server.

  10. In the next window click on Execute to proceed with the configuration. When the configuration is done, you'll see this:

  11. Click on Finish. After installing MySQL it is recommended that you install the GUI tools for administering and querying the database.

  12. Open an Internet browser and type http://dev.mysql.com/downloads/gui-tools/.

  13. Look for the Windows downloads and download the Windows (x86) package.

  14. Double-click the downloaded file. A wizard will guide you through the process.

  15. When asked about the setup type, select Complete.

  16. Several screens follow. Just follow the wizard instructions.

  17. When the wizard ends, you'll have the GUI tools added to the MySQL menu.

What just happened?

You downloaded and installed MySQL on your Windows system. You also installed MySQL GUI tools, a software package that includes an administrator and a query browser utility and that will make your life easier when working with the database.

Time for action – installing MySQL on Ubuntu


This tutorial shows you the procedure to install MySQL on Ubuntu.

Note

In order to follow the tutorial you need to be connected to the Internet.

Please follow these instructions:

  1. Check that you have access to the Internet.

  2. Open the Synaptic package manager from System | Administration | Synaptic Package Manager.

  3. Under Quick search type mysql-server and click on the Search button.

  4. Among the results, locate mysql-server-5.1, click in the tiny square to the left, and select Mark for Installation.

  5. You'll be prompted for confirmation. Click on Mark.

  6. Now search for a package named mysql-admin.

  7. When found, mark it for installation in the same way.

  8. Click on Apply on the main toolbar.

  9. A window shows up asking for confirmation. Click on Mark again. What follows is the download process followed by the installation process.

  10. At a particular moment a window appears asking you for a password for the root user—the administrator of the database. Enter a password of your choice. You'll have to enter it twice.

    Tip

    Think of a password that you can remember. You'll need it later to connect to the MySQL server.

  11. When the process ends, you will see the changes applied.

  12. Under Applications a new menu will also be added to access the GUI tools.

What just happened?

You installed MySQL server and GUI Tools in your Ubuntu system.

Tip

The previous directions are for standard installations. For custom installations, instructions related to other operating systems, or for troubleshooting, please check the MySQL documentation at—http://dev.mysql.com/doc/refman/5.1/en/installing.html.

Summary


In this first chapter, you were introduced to Pentaho Data Integration. Specifically, you learned what Pentaho Data Integration is and you installed the tool. You were also introduced to Spoon, the graphical designer of PDI, and you created your first transformation.

As an additional exercise, you installed a MySQL server and the MySQL GUI tools. You will need this software when you start working with databases in Chapter 8.

Now that you've learned the basics, you're ready to begin creating your own transformations to explore real data. That is the topic of the next chapter.

Left arrow icon Right arrow icon

Key benefits

  • Get started with Pentaho Data Integration from scratch.
  • Enrich your data transformation operations by embedding Java and JavaScript code in PDI transformations.
  • Create a simple but complete Datamart Project that will cover all key features of PDI.
  • Part of Packt's Beginner's Guide series: Each task gradually develops from the previous task with an organized sequence of instructions accompanied with relevant explanation and a set of challenging tasks that the reader will be able to accomplish.

Description

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

What you will learn

Install Penataho Data Integration and get familiar with the graphical designer—Spoon Work with files to get data from simple input sources, preview, and send it back in any of the common output formats Perform basic transformation operations on data such as performing calculations, adding constants, counting, filtering, ordering, and searching data Solve real-world problems by manipulating the flow of data by combining or splitting it Solve sophisticated problems such as normalizing data from pivoted tables with ease Explore the various PDI options to validate data and to handle errors Connect to a database engine of your choice to get and store information coming from any source Perform advanced operations with databases such as loading data dimensions Create advanced processes such as nesting jobs, iterating on jobs and transformations, and creating subtransformations Implement business processes by scheduling tasks, checking conditions, organizing files and folders, sending emails with alerts, and treating errors in a way that meets your requirements

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 4, 2010
Length 492 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781847199546
Vendor :
Pentaho
Category :

Table of Contents

27 Chapters
Pentaho 3.2 Data Integration Beginner's Guide Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
Foreword Chevron down icon Chevron up icon
The Kettle Project Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Getting Started with Pentaho Data Integration Chevron down icon Chevron up icon
Getting Started with Transformations Chevron down icon Chevron up icon
Basic Data Manipulation Chevron down icon Chevron up icon
Controlling the Flow of Data Chevron down icon Chevron up icon
Transforming Your Data with JavaScript Code and the JavaScript Step Chevron down icon Chevron up icon
Transforming the Row Set Chevron down icon Chevron up icon
Validating Data and Handling Errors Chevron down icon Chevron up icon
Working with Databases Chevron down icon Chevron up icon
Performing Advanced Operations with Databases Chevron down icon Chevron up icon
Creating Basic Task Flows Chevron down icon Chevron up icon
Creating Advanced Transformations and Jobs Chevron down icon Chevron up icon
Developing and Implementing a Simple Datamart Chevron down icon Chevron up icon
Taking it Further Chevron down icon Chevron up icon
Working with Repositories Chevron down icon Chevron up icon
Pan and Kitchen: Launching Transformations and Jobs from the Command Line Chevron down icon Chevron up icon
Quick Reference: Steps and Job Entries Chevron down icon Chevron up icon
Spoon Shortcuts Chevron down icon Chevron up icon
Introducing PDI 4 Features Chevron down icon Chevron up icon
Pop Quiz Answers Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela