Data Analytics

Exclusive offer: get 50% off this eBook here
Pentaho Data Integration Cookbook - Second Edition

Pentaho Data Integration Cookbook - Second Edition — Save 50%

Over 100 recipes for building open source ETL solutions with Pentaho Data Integration with this book and ebook

$29.99    $15.00
by Alex Meadows | November 2013 | Cookbooks

In this article created by Alex Meadows the author of Pentaho Data Integration Cookbook Second Edition, we will cover the following topics:

  • Reading data from a SAS datafile

  • Studying data via stream statistics

  • Building a random data sample for Weka

Introduction

Data Analytics is the art of taking data and deriving information from it in order to make informed decisions. A large part of building and validating datasets for the decision making process is data integration—the moving, cleansing, and transformation of data from the source to a target. This article will focus on some of the tools that take Kettle beyond the normal data processing capabilities and integrate processes into analytical tools.

Reading data from a SAS datafile

SAS is one of the leading analytics suites, providing robust commercial tools for decision making in many different fields. Kettle can read files written in SAS' specialized data format known as sas7bdat using a new (since Version 4.3) input step called SAS Input. While SAS does support other format types (such as CSV and Excel), sas7bdat is a format most similar to other analytics packages' special formats (such as Weka's ARFF file format). This recipe will show you how to do it.

Why read a SAS file?

There are two main reasons for wanting to read a SAS file as part of a Kettle process. The first is that a dataset created by a SAS program is already in place, but the output of this process is used elsewhere in other Business Intelligence solutions (for instance, using the output for integration into reports, visualizations, or other analytic tools). The second is when there is already a standard library of business logic and rules built in Kettle that the dataset needs to run through before it can be used.

Getting ready

To be able to use the SAS Input step, a sas7bdat file will be required. The Centers for Disease Control and Prevention have some sample datasets as part of the NHANES Dietary dataset. Their tutorial datasets can be found at their website at http://www.cdc.gov/nchs/tutorials/dietary/downloads/downloads.htm. We will be using the calcmilk.sas7bdat dataset for this recipe.

How to do it...

Perform the following steps to read in the calcmilk.sas7bd dataset:

  1. Open Spoon and create a new transformation.

  2. From the input folder of the Design pallet, bring over a Get File Names step.

  3. Open the Get File Names step. Click on the Browse button and find the calcmilk. sas7bd file downloaded for the recipe and click on OK.

  4. From the input folder of the Design pallet, bring over a SAS Input step. Create a hop from the Get File Names step to the SAS Input step.

  5. Open the SAS Input step. For the Field in the input to use as filename field, select the Filename field from the dropdown.

  6. Click on Get Fields. Select the calcmilk.sas7bd file and click on OK.

  7. If you are using Version 4.4 of Kettle, you will receive a java.lang.NoClassDefFoundError message. There is a work around which can be found on the Pentaho wiki at http://wiki.pentaho.com/display/EAI/SAS+Input.

  8. To clean the stream up and only have the calcmilk data, add a Select Values step and add a hop between the SAS Input step to the Select Values step. Open the Select Values step and switch to the Remove tab. Select the fields generated from the Get File Names step (filename, short_filename, path, and so on). Click on OK to close the step.

  9. Preview the Select Values step. The data from the SAS Input step should appear in a data grid, as shown in the following screenshot:

How it works...

The SAS Input step takes advantage of Kasper Sørensen's Sassy Reader project (http://sassyreader.eobjects.org). Sassy is a Java library used to read datasets in the sas7bdat format and is derived from the R package created by Matt Shotwell (https://github.com/BioStatMatt/sas7bdat). Before those projects, it was not possible to read the proprietary file format outside of SAS' own tools.

The SAS Input step requires the processed filenames to be provided from another step (like the Get File Names step). Also of note, while the sas7bdat format only has two format types (strings and numbers), PDI is able to convert fields to any of the built-in formats (dates, integers, and so on).

Pentaho Data Integration Cookbook - Second Edition Over 100 recipes for building open source ETL solutions with Pentaho Data Integration with this book and ebook
Published: December 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Studying data via stream statistics

While Kettle's forte is extracting, manipulating, and loading data, there is an entire set of tools built for generating statistics and analytic style data from the data stream. This recipe will focus on several of those tools that will allow for even more insight into your data. Kettle treats the data worked on in transformations as a stream going from an input to an output. The tools discussed in this recipe will show how to learn more about the data stream through gathering statistics about the data for analysis.

Getting ready

This recipe will not be a single large process, but made up of smaller recipes around the same subject. We will be using the Baseball salary dataset that can be found on the book's website or from Lahman's Baseball Archive website, found at http://www.seanlahman.com/baseball-archive/statistics/. The code for this recipe can also be found on the book's website.

The recipe will be broken into smaller recipes that will focus on three steps: Analytic Query, Group by, and Univariate Statistics. These steps will allow us to gain some insight into the baseball player's salaries, such as the salary change from one contract to the next, frequency of being traded, and so on.

How to do it...

Perform the following steps to learn how to use the Analytic Query step:

  1. Create a new transformation and add a Text file input step from the pallet to the canvas.

  2. Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file.

  3. Click on OK to close the Text file input step.

  4. Add a Sort rows step from the pallet to the canvas. Create a hop from the Text file input to the Sort rows step.

  5. The data needs to be sorted by playerID in the ascending order, with yearID in the descending order. Your Sort rows step should look similar to the following:

  6. From the Statistics folder, select the Analytic Query step and add it to the canvas. Create a hop from the Sort rows step to the Analytic Query step.

  7. For the Group field, select playerID. Fill in the Analytic Functions grid as follows:

  8. Add a Calculator step and create a hop from the Analytic Query step to the Calculator step.

  9. Open the Calculator step. Create a new field with the following criteria:

  10. Finally, preview the Calculator step. You should receive an output similar to the following:

Now, the salary information provides a little more detail and can show how much a player gained (or lost) over the course of their career. Now, let's look at another step that can help show even more detail around the baseball player salary dataset—Univariate Statistics.

Perform the following steps to learn how to use the Univariate Statistics step:

  1. Create a new transformation and add a Text file input step to the canvas.

  2. Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

  3. Bring a Univariate Statistics step from the pallet to the canvas and create a hop from the Text file input step to the Univariate Statistics step.

  4. Open the Univariate Statistics step. For the Input field, select salary. Set the value of N, Mean, Std dev, Min, Max, and Median to true.

  5. Click on OK to close the Univariate Statistics step and then preview the step. A row showing the various statistics around the salary will be displayed in the preview:

The data stream is processed, returning the salary statistics for the entire dataset. Now, as the last part of this recipe, let's explore the Group by step.

Perform the following steps to learn how to use the Group by step:

  1. Create a new transformation and add a Text file input step to the canvas.

  2. Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

  3. Add a Sort rows step to the canvas and create a hop from the Text file input step to the Sort rows step.

  4. Open the Sort rows step and sort the data on playerID in the ascending order.

  5. Add a Group by step to the canvas and create a hop from the Sort rows step to the Group by step.

  6. Open the Group by step. For the Group field, select playerID. Fill in the Aggregates data grid, as shown in the following screenshot:

  7. Click on OK to close the Group by step and preview the data. The data stream will be grouped by the individual players and show salary statistics per player:

How it works...

This recipe covered three different ways to find out more information about the data being processed, each collecting statistics about the data in ways that are reported on, but do not necessarily have to be recalculated every time a report or analysis is done. For each of the steps, there are two things to consider:

  • The data must be sorted based on the query requirements

  • The original data will not continue through the stream after being processed by these steps

The Analytic Query step provides the ability to compare multiple records through a data stream which has historically been a complicated thing to do with just SQL. Quite often, comparisons need to be made within a group of data, usually in the form of tracking changes from one record/period to the next. For our baseball salary dataset, we looked at each player as they changed from season to season and how their salaries changed.

The Univariate Statistics step provides common statistics for the data stream being analyzed. Having these values can be used for data validation, comparisons between data loads, and for reporting. In the baseball salary dataset, we used the Univariate Statistics step to see the metrics around salary (specifically the mean, min, and max salary numbers for all the records).

The Group by step not only provides the same types of statistics as the Univariate Statistics step, but also allows for grouping the data together. For the baseball dataset, we used the Group by step to see the metrics around the mean, min, and max salary numbers for each player.

Pentaho Data Integration Cookbook - Second Edition Over 100 recipes for building open source ETL solutions with Pentaho Data Integration with this book and ebook
Published: December 2013
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Building a random data sample for Weka

Weka is another open source tool that is officially supported by Pentaho, that focuses on data mining. Like it's cousins R and RapidMiner, Weka provides a library of statistical analysis tools that can be integrated into complex decision making systems. For this recipe, we will go over how to build a random dataset for Weka using Kettle.

Getting ready

We will be using the baseball player salaries data that can be found on the book's website or from Lahman's Baseball Archive website, found at http://www.seanlahman.com/baseball-archive/statistics/. The code for this recipe can also be found on the book's website.

This recipe also takes advantage of the ARFF Output plugin. This is available either via the Marketplace(for Kettle 5 and higher) or from the wiki at http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-Ins.

How to do it...

Perform the following steps to build a random data sample for Weka:

  1. Create a new transformation and add a Text file input step to the canvas.

  2. Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

  3. Add a Reservoir Sampling step from the Statistics folder in the pallet to the canvas. Create a hop from the Text file input step to the Reservoir Sampling step.

  4. Open the Reservoir Sampling step. Change the value of the Sample size (rows) field to 1000. Click on OK to close the step.

  5. Add an ARFF Output step to the canvas. Create a hop from the Reservoir Sampling step to the ARFF Output step.

  6. Open the ARFF Output step. For the File name field, call the file baseball_salaries.

  7. Switch to the Content tab. Ensure the Format matches the environment that you are working in (DOS for Windows, Unix for *nix).

  8. Switch to the Fields tab. Use the Get Fields button to get the data stream fields into the Fields data grid. The step will make a best guess at the ARFF type for each data element. Click on OK to close the step.

  9. Run the transformation. An ARFF file will be generated and can be used to work with the data within Weka.

How it works...

This recipe utilizes two steps, the first (Reservoir Sampling) of which can be used by anything that only needs a random data sample to process. The second one transforms the dataset into the standard format for Weka.

Reservoir Sampling takes large datasets and randomly selects records to create a smaller representative sample of the data. The two options in the step, sample size and random seed, control how big the sample set should get and how the records are randomly selected. For more details on the step, check out the Pentaho wiki at http://wiki.pentaho.com/display/DATAMINING/Using+the+Reservoir+Sampling+Plugin.

The ARFF Output step takes the data stream and stores the data in the standard format that Weka uses to process data. The first part of the file is the header, which provides the field details (name, type, and so on) and can also store the data source details (who created it, when the dataset was created, and so on). The second part fits the typical comma-separated values format, with each record's fields separated by a comma. To learn more about the format, check out the Weka wiki at http://weka.wikispaces.com/ARFF+(stable+version).

There's more...

There is another Kettle plugin that will actually take advantage of a model built in Weka and return the results back for further processing within Kettle. The step is called Weka Scoring and is a great place to start integrating machine learning into normal data integration processes. To learn more about the plugin, check out the wiki at http://wiki.pentaho.com/display/DATAMINING/Using+the+Weka+Scoring+Plugin.

There is a sub forum dedicated to working with Pentaho and Weka on the forums: http://forums.pentaho.com/forumdisplay.php?81-Pentaho-Data-Mining-WEKA.

To learn more about Weka, check out the Weka website for tutorials and other references http://www.cs.waikato.ac.nz/ml/weka.

Summary

This article showed you how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka.

Resources for Article:


Further resources on this subject:


About the Author :


Alex Meadows

Alex Meadows has worked with open source Business Intelligence solutions for nearly 10 years and has worked in various industries such as plastics manufacturing, social and e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in Pentaho and other open source communities to learn, share, and help newcomers with the best practices in BI, analytics, and data management. He received his Bachelor's degree in Business Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.

Books From Packt


Pentaho Data Integration 4 Cookbook
Pentaho Data Integration 4 Cookbook

Instant Pentaho Data Integration Kitchen
Instant Pentaho Data Integration Kitchen

Getting Started with Talend Open Studio for Data Integration
Getting Started with Talend Open Studio for Data Integration

Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho Reporting 3.5 for Java Developers
Pentaho Reporting 3.5 for Java Developers

Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial

Pentaho 5.0 Reporting by Example: Beginner’s Guide
Pentaho 5.0 Reporting by Example: Beginner’s Guide

Pentaho Data Integration Beginner's Guide - Second Edition
Pentaho Data Integration Beginner's Guide - Second Edition


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software