Reader small image

You're reading from  Pentaho Data Integration Quick Start Guide

Product typeBook
Published inAug 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343328
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
María Carina Roldán
María Carina Roldán
author image
María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán

Right arrow

Chapter 3. Extracting Data

Extracting data is all about getting and combining data from different sources, before transforming it in different ways. PDI offers connectivity to a big list of data sources, including all kinds of databases, both commercial and open source. It can also connect to a wide variety of files, both structured and unstructured. The list includes CSV files, properties files, fixed-width text files, and proprietary formats. In particular, this chapter will explain how to get data from plain files and relational databases.

The following topics will be covered in this chapter:

  • Getting data from plain files
  • Getting data from relational databases
  • Getting data from other sources
  • Combining different sources into a single dataset

Getting data from plain files


In this section, you will learn how to get data from plain files (for example, .txt and CSV files). We will start by explaining how to read and configure such files, and then we will explain how PDI allows you to read multiple files at once, compressed files, and files stored in remote locations.

Reading plain files

In the previous chapter, we experimented with reading a simple file, but this time we will go into detail on getting and properly configuring a simple file's metadata.

Note

For this and some of the future exercises in this book, we will use .csv files with surveys of the Airbnb website. The sample data can be downloaded from http://tomslee.net/airbnb-data-collection-get-the-data.

For this exercise, we will read and configure a file with data about a survey carried out in Amsterdam. The file looks as follows:

Sample file

This time, we will use a Text file input step, which is much more flexible than the CSV file input that you are familiar with:

  1. Create a...

Getting data from relational databases


Relational databases are some of the most common sources of data in any ETL process. PDI enables you to connect and get data from multiple RDBMS engines, including the most popular (for example, Oracle or MySQL) but also other engines. The only prerequisite is that there exists the proper JDBC driver. In this section, you will learn how to connect to, explore, and get data from a database.

Connecting to a database and using the database explorer

There are two things that you must do in order to connect to a database, if you intend to use its data inside PDI:

  • Install the proper JDBC driver
  • Create a connection to the database

Note

For demonstration purposes, we will connect to a PostgreSQL engine where we have installed a sports database, available for download at http://www.sportsdb.org/sd that you have a JDBC drive/samples.

Make sure that you have a JDBC driver, a .jar file – for the engine. Once you have it, you will have to copy it into the lib folder in...

Getting data from other sources


So far, we have been getting data from plain files and databases. These are two of the most common data sources, but there are many more kinds of sources available in PDI, mainly grouped in, but not limited to, the Input folder. The following subsections will present some of the sources that we didn't cover in the previous sections, which are also of use.

XML and JSON

With PDI, you can read XML files or parse fields whose contents are in an XML structure. In both cases, you parse the XML with the Get data from XML input step. For specifying the fields to read, you use XPath notation. When the XML is very big or complex, there is an alternative step:XML Input Stream (StAX).

Similarly, you can parse JSON structures with the JSON Input step. For specifying the fields in this case, you use JSONPath notation.

Also, you can parse both XML and JSON structures with JavaScript or Java code, by using the Modified Java Script Value step or the User Defined Java Class step...

Combining different sources into a single dataset


In this chapter, you have been getting data from different kinds of sources. In all cases, the source of the data was unique; for example, a plain file or the result of a single query. However, what if we had more than one source, and we wanted to combine them and use them as a single dataset? In this section, you will learn how to deal with this very common situation.

Manipulating the metadata

Let's look at the first exercise again, where we read a file containing surveys. On that occasion, we read all of the information in the file. Now, suppose that we are interested in just a few fields: room_idroom_typeneighborhoodoverall_satisfaction, accommodates, and price. Also, we want to rename some fields, and we want them in a different order.

There is a very easy way to do this, as follows:

  1. Open the transformation created in the first exercise and save it under a different name. You can do so from Main Menu or Main Toolbar.
  2. From the Transform...

Summary


In this chapter, you learned how to get data from different sources, converting it to PDI datasets.

First, you learned how to get data from plain files stored in your local system. You also had the opportunity to configure PDI to access compressed files and files stored in a Google Drive.

Having worked with files, you started to interact with relational databases. You learned how to configure a connection to a database, how to explore its content, and how to get data from it.

Finally, you were presented with sources other than plain files and databases, including XML and JSON sources and sources of system-related information.

Having explored the different options for getting external information, you learned how to combine two or more datasets into a single one. This task will be used not only for extracting and combining external sources but in many situations in your daily PDI work.

Now that you have the data, you are ready to transform it. You will learn how to do so in the next chapter...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pentaho Data Integration Quick Start Guide
Published in: Aug 2018Publisher: PacktISBN-13: 9781789343328
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán