Reader small image

You're reading from  KNIME Essentials

Product typeBook
Published inOct 2013
Reading LevelBeginner
PublisherPackt
ISBN-139781849699211
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Gábor Bakos
Gábor Bakos
author image
Gábor Bakos

Gbor Bakos is a programmer and a mathematician, having a few years of experience with KNIME and KNIME node development (HiTS nodes and RapidMiner integration for KNIME). In Trinity College, Dublin, the author was helping a research group with his data analysis skills (also had the opportunity to improve those), and with the new KNIME node development. When he worked for the evopro Kft. or the Scriptum Informatika Zrt., he was also working on various data analysis software products. He currently works for his own company, Mind Eratosthenes Kft. (www.mind-era.com), where he develops the RapidMiner integration for KNIME (tech.knime.org/community/rapidminer-integration), among other things.
Read more about Gábor Bakos

Right arrow

Chapter 3. Data Exploration

In this chapter, we will go through the main functions of KNIME visualization (except reporting) and other techniques to explore the data you have. This can be helpful when you want to do the preprocessing too, but you can also check the result of visualization or see how well they fit the computed models and the test/validation data. The topics covered in this chapter are as follows:

  • Statistics

  • Distance matrix

  • Visual properties

  • KNIME views and HiLiting

  • JFreeChart nodes

  • Some third party visualization options

  • Tips with HiLiting

  • Visualizing models

Computing statistics


When you want to explore your data, it usually is a good idea to compute some statistics about them so that you can spot the obviously wrong data (for example, when some data should be positive and it appears as a negative minimal value, it is suspicious).

Most of the nodes require you to not have NaN values within the data to be analyzed. You can remove them with the value modification techniques presented in the previous chapter, or by filtering the rows, also discussed in the previous chapter.

The minimal and maximal values can be checked in the port view's Spec Columns tab. This can already be used to spot certain kinds of problems.

For statistics within groups, we have the good old GroupBy node. That allows you to aggregate using the functions described on the Description tab of the configuration dialog.

When you do not need the grouping, you can use the Statistics node with easier configuration. Just select the columns, the number of values that should be present in...

Overview of visualizations


The various options to visualize data in KNIME allow you to get an overview or even publication-quality figures from the data you have preprocessed and analyzed.

The interactive versions of a node allow you to change the column selections and probably the other extra options.

The JFreeChart nodes generate images from the input data, which is also available as a view with further customization options. These nodes usually do not support the HiLite feature and the different visual properties (color, size, and shape).

First, to help decide what you use to open the data, we will compare the capabilities of the different visualization nodes:

Visual guide for the views


In this section, we will introduce the iris dataset (Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science. Iris dataset: http://archive.ics.uci.edu/ml/datasets/Iris) with some screenshots from the views (without their controls).

Box plot for the numeric columns

The Conditional Box Plot and the Box Plot nodes' views look similar. These are also sometimes called box-and-whisker diagrams. The Box Plot node visualizes the values of different columns, while the Conditional Box Plot view shows one column's values grouped by a nominal column's values. As you can see in the screenshot, the HiLite information is visible for the outliers (but only for those values). You can also select the outliers and HiLite them.

The shape of the outlier points is not influenced by the shape property.

Histogram with a few columns selected, HiLited rows and colored...

Distance matrix


The distance matrix is used not just for visualization, but for learning algorithms too. You can think of them as a column of collections, where each cell contains the difference between the previous rows.

The supported distance functions are the following:

  • Real distances

    • Euclidean()
    • Manhattan ()
    • Cosine ()
  • Bitvector distances

    • Tanimoto ()
    • Dice ()
    • Bitvector cosine ()
  • Distance vector (assuming you already have a distance vector, you can transform it to a distance matrix when there are row order changes or filtering)

  • Molecule distances (from extensions)

The distance matrix feature can be used together with the hierarchical clustering, which also provides a node to view it; this is the main reason we introduced them in this chapter.

You can generate distances using the Distance Matrix Calculate node (just select the function, the numeric columns, and set the name. The chunk size is just for fine tuning larger tables), but you can also load that information with the Distance Matrix...

Using visual properties


One of KNIME's great features is that it allows you to set certain properties of the views in advance. So, you need not remember how you set them in one view and how it is set in another, you just have to connect them to the same table. This is a big step towards reproducible experimental results and figures with the ease of graphical configuration. Each property is applied to the rows based on column values, so changes in column values will affect (remove) the property and each kind of property is exclusive (a new node with the same kind of properties replaces the original property). When you want to reuse the properties in another place of the workflow, you can use the appender nodes.

The three supported properties are: color, size, and shape.

Color

With the Color Manager node, you can set the color for different rows. The colors can be assigned either to a nominal or a numeric column.

In the case of the nominal columns, each value can have a different color. This...

KNIME views


You can export the view contents to either the PNG or SVG files from the File | Export as menu. (The latter is only available when the KNIME SVG Support is installed.)

It is worth noting the other usual view controls. The File menu contains the Always on top and Close options, besides the previously discussed Export as menu. The first option allows you to compare the multiple views easily by having them side-by-side and still working with other windows.

The rest of the menus are related to HiLiting, which will be discussed soon.

The configuration of nodes usually includes an option of how many different values or how many rows should be used when you create the view. Because the views usually load all the data (or the specified amount) in the memory to have a resizable content, too many rows would require too much memory, while too many different values would make it hard to understand either the legends or the whole view in certain cases.

The mouse mode controls allow you to select...

Basic KNIME views


The main views of KNIME give you multiple options to explore data. These nodes do not provide options to generate images for further nodes, but they give quite a good overview about the data, and you can save the files using the File menu.

There are different flavors for some of the nodes: the interactive and the normal. With the interactive flavor, you can modify certain parameters of the view without reconfiguring (and executing) the view. The interactive versions are better suited for data exploration, but the normal ones make it easier to check certain things with new data.

The Box plots

The Box Plot node has no configuration, but gives robust statistics (minimum, smallest, lower quartile, median, largest, and maximum) for numeric columns. You might wonder about the difference between the minimum and the smallest values or the largest and maximum values. The smallest is the maximum of the minimal value and the value. The largest is computed analogously.

The view gives...

JFreeChart


The JFreeChart nodes are not installed by default, but the extension is available from the standard KNIME update site under the name KNIME JFreeChart.

The common part of these nodes is that you have to specify the appearance of the result in advance, and the focus is not on the view, but on the resulting image port object.

In the General Plot Options Configuration tab, you can specify the type of the resulting image (PNG or SVG), the size, the title, colors, and the font size (relative to the standard font for each item printed).

You can use the port objects in the reports, but you can also use them to check certain properties if you iterate through a loop and convert the result with Image To Table.

It is important to note that the customizable JFreeChart View tab is only available in freshly executed nodes. The generated image can be visualized either using the view or the image output.

In the JFreeChart View tab, you can customize (from the context menu) almost every aspect of the...

Open Street Map


In the KNIME Labs Extensions (available from the main KNIME update site) you can install the KNIME Open Street Map Integration in order to visualize spatial data.

This extension contains two nodes, OSM Map View and OSM Map to Image. The first one is the interactive, you can browse the map and check the data points (the tooltips can give details about them), think find the distribution of interesting points by HiLiting them. (HiLiting cannot be done using these nodes, but you can select area "blindly" if you use a Scatter Plot with the longitude and latitude information.)

Both nodes require coordinates to be in the range of -90 to 90 for latitude and -180 to 180 for longitude if there is an input table (which is optional). The image node's configuration includes a map to select which area should be visible on the resulting image, the configuration for the coordinates is on the Map Marker tab.

In the OSM Map View, you can browse by holding the right mouse button down and moving...

3D Scatterplot


We are highlighting a view from the many third party views because this is really neatly done, and you might not find it initially interesting if you do not work with chemical data.

In the Erl Wood Open Source Nodes extension (from the community update site), you can find a node called 2D/3D Scatterplot. It allows you to plot 3D data and still use KNIME The HiLite functionality and the color, and size properties (but that can also be selected on demand). This is a very well designed and implemented view node. Its configuration is limited to column filtering and the number of rows/distinct values that should be on the screen.

This node does not support the automatic generation of a diagram. It's more focused towards exploration and not towards creating final figures.

It can also provide a regression fit line in 2D mode. It can be a good alternative to the normal Scatter Plot node too (unless you need the shape properties).

A right-click on the canvas gives information about the...

Other visualization nodes


There are many options to show data, and you really do not have to limit yourself with those which are bundled with KNIME. In the community contributions (http://tech.knime.org/community), there are many options available. We will cherry-pick some of the more general and interesting visualization nodes.

The R plot, Python plot, and Matlab plot

The R plot, Python plot, and Matlab plot are available from the corresponding scripting extensions (the KNIME R Scripting extension, KNIME Python Scripting extension, and KNIME Matlab Scripting extension) on the community nodes update site.

The usage of these nodes do not require experience in the corresponding programming languages. There are templates from which you can choose and the parameters can be adjusted using KNIME controls. Obviously, you can create your own templates or fine-tune existing ones if you are not satisfied.

You need to have access to (possibly local) servers to connect to the extensions. (The Python Plot...

Tips for HiLiting


HiLiting gives great tools for various tasks: outlier detection, manual row selection, and visualization of a custom subset.

Using Interactive HiLite Collector

First, let's assume you want to label the different outlier categories. In case of an iris dataset, the outlier categories should be the high sepal length, high sepal width, high petal length, high petal width, and their lower counterparts. You can also select the outliers by different classes (iris-setosa, iris-versicolor, and iris-virginica) for each column (in both extreme directions), which gives possible options. Quite a lot, but you will need only four views to compute these (and only a single, if you do not want to split according to the classes).

Let's see how this can be done. We will cover only the simpler (no-class) analysis.

Connect the Box Plot node to the data source. Also, connect the Interactive HiLite Collector node to it. Open both the views; you should execute Box Plot, and the collector.

There are...

Visualizing models


In the previous chapter, we created a workflow to generate a grid. That must have looked pointless at that time, but now, we will move a bit forward and show an application. The GenerateGridForLogisticRegression.zip file contains the workflow demonstrating this idea with the iris dataset.

In this workflow, we use a setup very similar to the Generate Grid workflow till the preprocessing meta node, but in this case, we use the average of minimum and maximum values instead of creating NaN values when we generate a grid with a single value in that dimension. (This will be important when we apply the model.) We also modified the grid parameters to be compatible with the iris dataset. In the lower region of the workflow, we load the iris dataset from http://archive.ics.uci.edu/ml/datasets/Iris, so we can create a logistic regression model with the Logistic Regression (Learner) node (it uses all numeric columns).

We would like to apply this model to both the data and the grid....

Summary


In this chapter, we introduced the main visualization nodes and the statistical techniques that could be used to explore your data. We built on the knowledge you gathered in the previous chapter, because data transformation is inevitable in a complex analysis. The HiLiting was previously introduced, but with the use cases in this chapter, you might now have a better idea about when you should use it.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
KNIME Essentials
Published in: Oct 2013Publisher: PacktISBN-13: 9781849699211
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gábor Bakos

Gbor Bakos is a programmer and a mathematician, having a few years of experience with KNIME and KNIME node development (HiTS nodes and RapidMiner integration for KNIME). In Trinity College, Dublin, the author was helping a research group with his data analysis skills (also had the opportunity to improve those), and with the new KNIME node development. When he worked for the evopro Kft. or the Scriptum Informatika Zrt., he was also working on various data analysis software products. He currently works for his own company, Mind Eratosthenes Kft. (www.mind-era.com), where he develops the RapidMiner integration for KNIME (tech.knime.org/community/rapidminer-integration), among other things.
Read more about Gábor Bakos

Node

Supported data types

Remarks

Box Plot

Numeric (multiple)

Provides robust stats

Conditional Box Plot

Nominal and numeric (multiple)

Also gives robust stats

Histogram

Nominal or numeric and numeric

 

Histogram (interactive)

Nominal or numeric and numeric

 

Interactive Table

Any

Similar to port view

Lift...