Reducing Data Size

Exclusive offer: get 50% off this eBook here
Exploring Data with RapidMiner

Exploring Data with RapidMiner — Save 50%

Explore, understand, and prepare real data using RapidMiner's practical tips and tricks with this book and ebook

$23.99    $12.00
by Andrew Chisholm | November 2013 | Open Source

In this article by Andrew Chisholm, author of Exploring Data with RapidMiner, we will learn how to select attributes using models.

(For more resources related to this topic, see here.)

Selecting attributes using models

Weighting by the PCA approach, mentioned previously, is an example where the combination of attributes within an example drives the generation of the principal components, and the correlation of an attribute with these generates the attribute's weight.

When building classifiers, it is logical to take this a stage further and use the potential model itself as the determinant of whether the addition or removal of an attribute makes for better predictions. RapidMiner provides a number of operators to facilitate this, and the following sections go into detail for one of these operators with the intention of showing how applicable the techniques are to other similar operations. The operator that will be explained in detail is Forward Selection. This is similar to a number of others in the Optimization group within the Attribute selection and Data transformation section of the RapidMiner GUI operator tree. These operators include Backward Elimination and a number of Optimize Selection operators. The techniques illustrated are transferrable to these other operators.

A process that uses Forward Selection is shown in the next screenshot:

The Retrieve operator (labeled 1) simply retrieves the sonar data from the local sample repository. This data has 208 examples and 60 regular attributes named attribute_1 to attribute_60. The label is named class and has two values, Rock and Mine.

Forward Selection operator (labeled 2) tests the performance of a model on The examples containing more and more attributes. The inner operators within this operator perform this testing.

The Log to Data operator (labeled 3) creates an example set from the log entries that were written inside the Forward selection operator. Example sets are easier to process and store in the repository.

The Guess Types operator (labeled 4) changes the types of attributes based on their The contents. This is simply a cosmetic step to change real numbers into integers to make plotting them look better.

Now, let's return to the Forward Selection operator, which starts by invoking its inner operators to check the model performance using each of the 60 regular attributes individually. This means it runs 60 times. The attribute that gives the best performance is then retained, and the process is repeated with two attributes using the remaining 59 attributes along with the best from the first run. The best pair of attributes is then retained, and the process is repeated with three attributes using each of the remaining 58. This is repeated until the stopping conditions are met. For illustrative purposes, the parameters shown in the following screenshot are chosen to allow it to continue for 60 iterations and use all the 60 attributes.

The inner operator to the Forward Selection operator is a simple cross validation with the number of folds set to three. Using cross validation ensures that the performance is an estimate of what the performance would be on unseen data. Some overfitting will inevitably occur, and it is likely that setting the number of validations to three will increase this. However, this process is for illustrative purposes and needs to run reasonably quickly, and a low cross-validation count facilitates this.

Inside the Validation operator itself, there are operators to generate a model, calculate performance, and log data. These are shown in the following screenshot:

The Naïve Bayes operator is a simple model that does not require a large runtime to complete. Within the Validation operator, it runs on different training partitions of the data. The Apply Model and Performance operators check the performance of the operator using test partitions. The Log operator outputs information each time it is called, and the following screenshot shows the details of what it logs.

Running the process gives the log output as shown in the following screenshot:

It is worth understanding this output because it gives a good overview of how the operators work and fit together in a process. For example, the attributes applyCountPerformance, applyCountValidation, and applyCountForwardSelection increment by one each time the respective operator is executed. The expected behavior is that applyCountPerformance will increment with each new row in the result, applyCountValidation will increment every three rows, which corresponds to the number of cross validation folds, and applyCountForwardSelection will remain at 1 throughout the process. Note that validationPerformance is missing for the first three rows. This is because the validation operator has not calculated a performance yet. The first occurrence of the logging operator is called validationPerformance; it is the average of innerPerformance within the validation operator. So, for example, the values for innerPerformance are 0.652, 0.514, and 0.580 for the first three rows; these values average out to 0.582, which is the value for validationPerformance in the fourth row. The featureNames attribute shows the attributes that were used to create the various performance measurements.

The results are plotted as a graph as shown:

This shows that as the number of attributes increases, the validation performance increases and reaches a maximum when the number of attributes is 23. From there, it steadily decreases as the number of attributes reaches 60.

The best performance is given by the attributes immediately before the maximum validationPerformance attribute value. In this case, the attributes are:

attribute_12, attribute_40, attribute_16, attribute_11, attribute_6, attribute_28, attribute_19, attribute_17, attribute_44, attribute_37, attribute_30, attribute_53, attribute_47, attribute_22, attribute_41, attribute_54, attribute_34, attribute_23, attribute_27, attribute_39, attribute_57, attribute_36, attribute_10.

The point is that the number of attributes has reduced and indeed the model accuracy has increased. In real-world situations with large datasets and a reduction in the attribute count, an increase in performance is very valuable.


This article has covered the important topic of reducing data size by both the removal of examples and attributes. This is important to speed up processing time, and in some cases can even improve classification accuracy. Generally though, classification accuracy reduces as data reduces.

Resources for Article:

Further resources on this subject:

Exploring Data with RapidMiner Explore, understand, and prepare real data using RapidMiner's practical tips and tricks with this book and ebook
Published: November 2013
eBook Price: $23.99
Book Price: $39.99
See more
Select your format and quantity:

About the Author :

Andrew Chisholm

Andrew Chisholm completed his degree in Physics from Oxford University nearly thirty years ago. This coincided with the growth in software engineering and it led him to a career in the IT industry. For the last decade he has been very involved in mobile telecommunications, where he is currently a product manager for a market-leading test and monitoring solution used by many mobile operators worldwide.

Throughout his career, he has always maintained an active interest in all aspects of data. In particular, he has always enjoyed finding ways to extract value from data and presenting this in compelling ways to help others meet their objectives. Recently, he completed a Master's in Data Mining and Business Intelligence with first class honors. He is a certified RapidMiner expert and has been using this product to solve real problems for several years. He maintains a blog where he shares some miscellaneous helpful advice on how to get the best out of RapidMiner.

He approaches problems from a practical perspective and has a great deal of relevant hands-on experience with real data. This book draws this experience together in the context of exploring data—the first and most important step in a data mining

He has published conference papers relating to unsupervised clustering and cluster validity measures and contributed a chapter called Visualizing cluster validity measures to an upcoming book entitled RapidMiner: Use Cases and Business Analytics Applications, Chapman & Hall/CRC

Books From Packt

Clojure Data Analysis Cookbook
Clojure Data Analysis Cookbook

Yii Rapid Application Development Hotshot
Yii Rapid Application Development Hotshot

RapidWeaver 5 Beginner's Guide
RapidWeaver 5 Beginner's Guide

Rapid BeagleBoard Prototyping with MATLAB and Simulink
Rapid BeagleBoard Prototyping with MATLAB and Simulink

CodeIgniter for Rapid PHP Application Development
CodeIgniter for Rapid PHP Application Development

CherryPy Essentials: Rapid Python Web Application Development
CherryPy Essentials: Rapid Python Web Application Development

EVE Online: ISK Strategy Guide
EVE Online: ISK Strategy Guide

Getting Started with Greenplum for Big Data Analytics Sunila Gollapudi
Getting Started with Greenplum for Big Data Analytics Sunila Gollapudi

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software