Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
KNIME Essentials

You're reading from  KNIME Essentials

Product type Book
Published in Oct 2013
Publisher Packt
ISBN-13 9781849699211
Pages 148 pages
Edition 1st Edition
Languages
Author (1):
Gábor Bakos Gábor Bakos
Profile icon Gábor Bakos

Chapter 2. Data Preprocessing

Data preprocessing usually takes a lot of time to set up, because you have to take care of lot of different formats or sources. In this chapter, we will introduce the basic options to not only read and generate data but also to reshape them. Changing values is also a common task, and we will cover that too.

It is a good practice to keep checking certain constraints especially if you have volatile input sources. This is also important in KNIME. Finally, we will go through an example of workflow from import to preprocessing. In this chapter, we will cover the following topics:

  • Data import

    • From database

    • From files

    • From web services

  • Regular expressions

  • Transforming tables

  • Transforming values

  • Generating data

  • Constraints

  • Case studies

Importing data


Your data can be from multiple sources, such as databases, Internet/intranet, or files. This section will give a short introduction to the various options.

Importing data from a database

In this section, we will use the Java DB (http://www.oracle.com/technetwork/java/javadb/index.html) to create a local database because it is supported by Oracle, bundled with JDKs, cross-platform, and easy to set up. The database we use is described on eclipse's BIRT Sample Database page (http://www.eclipse.org/birt/phoenix/db/#schema).

Starting Java DB

Once you have Java DB installed (unzipped the binary distribution from Derby (http://db.apache.org/derby/derby_downloads.html) or located your JDK), you should also download the BirtSample.jar file from this book's website (originally from http://mirror-fpt-telecom.fpt.net/eclipse/birt/update-site/3.7-interim/plugins/org.eclipse.birt.report.data.oda.sampledb_3.7.2.v20120213.jar.pack.gz). Unzip the content to the database server's install folder...

Regular expressions


Regular expressions are excellent for simpler parsing tasks, replaces, or splits. We will give a short introduction on them and show some examples. These will allow you to get better idea. At the end of this section, we will suggest further reading.

Basic syntax

Usually, when you write a text as a pattern, this means that the text will be matched; for example, apple or pear will match the highlighted parts from the following sentence: "Apple stores do not sell apple or pear ."

These are case sensitive by default, so if the pattern were to be simply apple, this will not match the first word of the sentence or the company name.

There are special characters that need to be escaped when you want to match them: ., [, ], (, ), {, }, -, ^, $, \ (Well, some of these only in certain positions). To escape them, you should prefix them with \, which will result in the following patterns: \., \[, \], \(, \), \{, \}, \-, \^, \$, \\.

When you do not want an exact match of characters, you...

Transforming the shape


There are multiple ways to change the shape of the data. Usually, it is just projection or filtering, but there are more complex options too.

Filtering rows

For row filters, the usual naming convention is used; that is, the node names ending with "Filter" give only a single table as a result, while the "Splitter" nodes generate two tables: one for the matches and one for the non-matching rows.

For single-column conditions, the Row Filter (and Row Splitter) node can be used to select rows based on a column value in a range, regular expression, or missing values. It is also possible to keep only these rows or filter these out. For row IDs, you can only use the regular expressions.

The rows can also be filtered by the (one-based) row index.

The Nominal Value Row Filter node gives a nice user interface when the possible values of textual columns are known at configuration time; so, you do not have to create complex regular expressions to match only those exact values.

There...

Transforming values


Sometimes the data that you have requires further processing; that is, not just moving around but also changing some values.

Generic transformations

A quite flexible node is the Rule Engine node that creates or replaces a column based on certain rules involving other columns. It might contain logical and relational operators for texts, and it can even check for inclusion (IN) for a certain set of values or limited pattern matching (LIKE as in SQL). The result can be either a constant, a column's value, or a flow variable. It can also handle the missing values.

When you want to fill the metadata, you should use the Domain Calculator node. With the help of this node, you can create nominal columns from textual (String) columns.

Java snippets

The most generic cell transformation nodes are the Java snippet nodes (Java Snippet and Java Snippet (Simple)). They allow you to use the third-party libraries, custom code to transform a row's values, or append new columns with those values...

Data generation


There is a KNIME Labs plug-in named KNIME Datageneration (http://tech.knime.org/datageneration). It gives support to generate values from different distributions for existing rows to new columns:

  • Random Number Assigner: It supports uniform distribution

  • Gaussian Distributed Assigner: It supports Gaussian distribution

  • Beta Distributed Assigner: It supports beta distribution

  • Gamma Distributed Assigner: It supports gamma distribution

To generate rows with numeric content, the most obvious node is Data Generator. It generates data for clusters of normally distributed data for various dimensions with different cluster centers on the [0,1] interval. It also generates the cluster labels.

To generate empty rows for existing tables, the Add Empty Rows node gives options. You might also want to create a table specification before you add (new or additional) empty rows. This can be done using the Create Table Structure or the Table Creator nodes. Both are manual, but if you have a tab...

Constraints


You can seldom trust the data you have because there can be network problems during import, or the program that was generated was wrongly parameterized, the program got invalid input, or the device you used to collect the data was used out of its operating conditions. For these reasons, it is a good practice to find constraints and check them after import or more complex transformations. You should also check the user input, and if it might cause hard-to-discover problems in later phases, report them as soon as you can.

The Flow Control/Switches nodes can be used to enable the workflow parts selectively (this is useful if the check of constraints is not always required, or it is too time consuming to be on by default or to try correcting the wrong data), but the loop-related nodes (Flow Control/Loop Support) are also useful when multiple columns should be tested and can handle complex conditions.

In the preceding screenshot, a flow variable comes from outside of the meta node,...

Loops


Doing the same thing multiple times might look like a bad idea, but we usually are doing slightly different things in each iteration, and with loops, we can factor out the repetition, and our workflows are easily reused.

A few notes about the loops:

  • The flow variables that they generate are read-only; when you replace them, you do not modify them (as those are handled internally), just hide them from further processing

  • The loops can be nested, so it is possible to have things done quite a lot of times

The simple Counting Loop Start node just feeds the same input table (as many times as specified) to the loop, each time increasing the currentIteration flow variable.

When you would like to iterate without the [0, maxIteration-1] interval or the preferred increment is not one, you should consider using the Interval Loop Start node instead of the counting.

Iterating through a table and splitting the input table to smaller chunks can be useful when it is too large to handle it with the workflow...

Workflow customization


It is highly recommended to install the KNIME Nodes to create KNIME Quick Forms extension from the standard KNIME update site, because its nodes allow you to create configuration points for a whole part of your workflow. This way, your users can customize their needs more easily or just experiment with different parameters.

Here comes a short introduction to the Quick Form nodes. First, we will group these by what kind of information is generated:

  • Boolean Input: It generates logical information

  • Integer Input: It generates integer number

  • Double Input: It generates double number

  • Date (String) Input: It generates date (as text)

  • Column Filter QuickForm (multiple column names; results in an empty table) Column Selection QuickForm (single column name): These generate date (as text)

  • String Input (single line), String Radio Buttons, Single Selection Input QuickForm (single choice from enumerated possible values), Multiple Selection Input QuickForm (multiple choice from enumerated...

Case study – finding min-max in the next n rows


In the next few sections, we will introduce some problems and our solution to them using KNIME.

Sometimes you are fine with the moving average for date type values, but in certain situations, you need the range of values for a window. In the workflow available in the sliding_minmax.zip file, we will do exactly this. We are assuming an equidistant distribution of date values in the rows; you can try to generalize to remove this restriction.

In the preceding screenshot, first (after generating some sample data) we add an ID based on the row index, then shift the content by the specified value in the Integer Input node, and finally combine the tables to find min and max values.

The main idea we use is described in the following steps: create a new table for each position in the sliding window (each shifted according to the position), and combine these tables using an identifier. Finally, we use the GroupBy node to select the values. Alternatively...

Case study – ranks within groups


In this case, we will compute ranks (based on a certain order) within groups. This is a much easier task, but can be very useful if you want to select the outliers without prior knowledge to define cut-off points. However, it can also be useful for summarizing historical data (find the three/five top hits leading the sales list the longest in different genres, for example). There is also a simplification when we do not need the rank, but just the extreme values. But, certain algorithms can use the rank values for better predictions, because we humans are biased to the best options. For example, in a 100-minute race, the difference between the first and the fifth drivers, is one minute hypothetically; that is it amounts to one percent. It's a quite small difference, although the difference in the prizes and fame are much larger.

The example workflow is in the GroupRanks.zip file.

First, we generate some sample data with the Data Generator node, just like before...

Summary


In this chapter, we have constructed KNIME workflows to work with various data sources (generated, Internet, file, and database). We introduced the most important nodes to transform and preprocess the data we have. We have also combined this knowledge to implement solutions to different problems. By now, you would have an idea of how to construct KNIME workflows and how to use the flow variables and the loops.

lock icon The rest of the chapter is locked
You have been reading a chapter from
KNIME Essentials
Published in: Oct 2013 Publisher: Packt ISBN-13: 9781849699211
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}