Packt+ | Advance your knowledge in tech

You're reading from Pentaho Data Integration Quick Start Guide

Product typeBook

Published inAug 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781789343328

Edition1st Edition

Languages

Java

Tools

Pentaho

Concepts

Business Intelligence

Author (1)

María Carina Roldán

Chapter 3. Extracting Data

Extracting data is all about getting and combining data from different sources, before transforming it in different ways. PDI offers connectivity to a big list of data sources, including all kinds of databases, both commercial and open source. It can also connect to a wide variety of files, both structured and unstructured. The list includes CSV files, properties files, fixed-width text files, and proprietary formats. In particular, this chapter will explain how to get data from plain files and relational databases.

The following topics will be covered in this chapter:

Getting data from plain files
Getting data from relational databases
Getting data from other sources
Combining different sources into a single dataset

Getting data from plain files

In this section, you will learn how to get data from plain files (for example, .txt and CSV files). We will start by explaining how to read and configure such files, and then we will explain how PDI allows you to read multiple files at once, compressed files, and files stored in remote locations.

Reading plain files

In the previous chapter, we experimented with reading a simple file, but this time we will go into detail on getting and properly configuring a simple file's metadata.

Note

For this and some of the future exercises in this book, we will use .csv files with surveys of the Airbnb website. The sample data can be downloaded from http://tomslee.net/airbnb-data-collection-get-the-data.

For this exercise, we will read and configure a file with data about a survey carried out in Amsterdam. The file looks as follows:

Sample file

This time, we will use a Text file input step, which is much more flexible than the CSV file input that you are familiar with:

Create a...

Getting data from relational databases

Relational databases are some of the most common sources of data in any ETL process. PDI enables you to connect and get data from multiple RDBMS engines, including the most popular (for example, Oracle or MySQL) but also other engines. The only prerequisite is that there exists the proper JDBC driver. In this section, you will learn how to connect to, explore, and get data from a database.

Connecting to a database and using the database explorer

There are two things that you must do in order to connect to a database, if you intend to use its data inside PDI:

Install the proper JDBC driver
Create a connection to the database

Note

For demonstration purposes, we will connect to a PostgreSQL engine where we have installed a sports database, available for download at http://www.sportsdb.org/sd that you have a JDBC drive/samples.

Make sure that you have a JDBC driver, a .jar file – for the engine. Once you have it, you will have to copy it into the lib folder in...

Getting data from other sources

So far, we have been getting data from plain files and databases. These are two of the most common data sources, but there are many more kinds of sources available in PDI, mainly grouped in, but not limited to, the Input folder. The following subsections will present some of the sources that we didn't cover in the previous sections, which are also of use.

XML and JSON

With PDI, you can read XML files or parse fields whose contents are in an XML structure. In both cases, you parse the XML with the Get data from XML input step. For specifying the fields to read, you use XPath notation. When the XML is very big or complex, there is an alternative step:XML Input Stream (StAX).

Similarly, you can parse JSON structures with the JSON Input step. For specifying the fields in this case, you use JSONPath notation.

Also, you can parse both XML and JSON structures with JavaScript or Java code, by using the Modified Java Script Value step or the User Defined Java Class step...

Combining different sources into a single dataset

In this chapter, you have been getting data from different kinds of sources. In all cases, the source of the data was unique; for example, a plain file or the result of a single query. However, what if we had more than one source, and we wanted to combine them and use them as a single dataset? In this section, you will learn how to deal with this very common situation.

Manipulating the metadata

Let's look at the first exercise again, where we read a file containing surveys. On that occasion, we read all of the information in the file. Now, suppose that we are interested in just a few fields: room_id, room_type, neighborhood, overall_satisfaction, accommodates, and price. Also, we want to rename some fields, and we want them in a different order.

There is a very easy way to do this, as follows:

Open the transformation created in the first exercise and save it under a different name. You can do so from Main Menu or Main Toolbar.
From the Transform...

Summary

In this chapter, you learned how to get data from different sources, converting it to PDI datasets.

First, you learned how to get data from plain files stored in your local system. You also had the opportunity to configure PDI to access compressed files and files stored in a Google Drive.

Having worked with files, you started to interact with relational databases. You learned how to configure a connection to a database, how to explore its content, and how to get data from it.

Finally, you were presented with sources other than plain files and databases, including XML and JSON sources and sources of system-related information.

Having explored the different options for getting external information, you learned how to combine two or more datasets into a single one. This task will be used not only for extracting and combining external sources but in many situations in your daily PDI work.

Now that you have the data, you are ready to transform it. You will learn how to do so in the next chapter...

The rest of the chapter is locked

You have been reading a chapter from

Pentaho Data Integration Quick Start Guide

Published in: Aug 2018Publisher: PacktISBN-13: 9781789343328

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages