Reducing Data Size

RapidMiner is a highly versatile tool that can make data work harder for you. This book will show you how to import, parse, and structure your data with remarkable speed and efficiency. It’s data mining made accessible.

(For more resources related to this topic, see here.)

Selecting attributes using models

Weighting by the PCA approach, mentioned previously, is an example where the combination of attributes within an example drives the generation of the principal components, and the correlation of an attribute with these generates the attribute's weight.

When building classifiers, it is logical to take this a stage further and use the potential model itself as the determinant of whether the addition or removal of an attribute makes for better predictions. RapidMiner provides a number of operators to facilitate this, and the following sections go into detail for one of these operators with the intention of showing how applicable the techniques are to other similar operations. The operator that will be explained in detail is Forward Selection. This is similar to a number of others in the Optimization group within the Attribute selection and Data transformation section of the RapidMiner GUI operator tree. These operators include Backward Elimination and a number of Optimize Selection operators. The techniques illustrated are transferrable to these other operators.

A process that uses Forward Selection is shown in the next screenshot:

The Retrieve operator (labeled 1) simply retrieves the sonar data from the local sample repository. This data has 208 examples and 60 regular attributes named attribute_1 to attribute_60. The label is named class and has two values, Rock and Mine.

Forward Selection operator (labeled 2) tests the performance of a model on The examples containing more and more attributes. The inner operators within this operator perform this testing.

The Log to Data operator (labeled 3) creates an example set from the log entries that were written inside the Forward selection operator. Example sets are easier to process and store in the repository.

The Guess Types operator (labeled 4) changes the types of attributes based on their The contents. This is simply a cosmetic step to change real numbers into integers to make plotting them look better.

Now, let's return to the Forward Selection operator, which starts by invoking its inner operators to check the model performance using each of the 60 regular attributes individually. This means it runs 60 times. The attribute that gives the best performance is then retained, and the process is repeated with two attributes using the remaining 59 attributes along with the best from the first run. The best pair of attributes is then retained, and the process is repeated with three attributes using each of the remaining 58. This is repeated until the stopping conditions are met. For illustrative purposes, the parameters shown in the following screenshot are chosen to allow it to continue for 60 iterations and use all the 60 attributes.

The inner operator to the Forward Selection operator is a simple cross validation with the number of folds set to three. Using cross validation ensures that the performance is an estimate of what the performance would be on unseen data. Some overfitting will inevitably occur, and it is likely that setting the number of validations to three will increase this. However, this process is for illustrative purposes and needs to run reasonably quickly, and a low cross-validation count facilitates this.

Inside the Validation operator itself, there are operators to generate a model, calculate performance, and log data. These are shown in the following screenshot:

The Naïve Bayes operator is a simple model that does not require a large runtime to complete. Within the Validation operator, it runs on different training partitions of the data. The Apply Model and Performance operators check the performance of the operator using test partitions. The Log operator outputs information each time it is called, and the following screenshot shows the details of what it logs.

Running the process gives the log output as shown in the following screenshot:

It is worth understanding this output because it gives a good overview of how the operators work and fit together in a process. For example, the attributes applyCountPerformance, applyCountValidation, and applyCountForwardSelection increment by one each time the respective operator is executed. The expected behavior is that applyCountPerformance will increment with each new row in the result, applyCountValidation will increment every three rows, which corresponds to the number of cross validation folds, and applyCountForwardSelection will remain at 1 throughout the process. Note that validationPerformance is missing for the first three rows. This is because the validation operator has not calculated a performance yet. The first occurrence of the logging operator is called validationPerformance; it is the average of innerPerformance within the validation operator. So, for example, the values for innerPerformance are 0.652, 0.514, and 0.580 for the first three rows; these values average out to 0.582, which is the value for validationPerformance in the fourth row. The featureNames attribute shows the attributes that were used to create the various performance measurements.

The results are plotted as a graph as shown:

This shows that as the number of attributes increases, the validation performance increases and reaches a maximum when the number of attributes is 23. From there, it steadily decreases as the number of attributes reaches 60.

The best performance is given by the attributes immediately before the maximum validationPerformance attribute value. In this case, the attributes are:

attribute_12, attribute_40, attribute_16, attribute_11, attribute_6, attribute_28, attribute_19, attribute_17, attribute_44, attribute_37, attribute_30, attribute_53, attribute_47, attribute_22, attribute_41, attribute_54, attribute_34, attribute_23, attribute_27, attribute_39, attribute_57, attribute_36, attribute_10.

The point is that the number of attributes has reduced and indeed the model accuracy has increased. In real-world situations with large datasets and a reduction in the attribute count, an increase in performance is very valuable.


This article has covered the important topic of reducing data size by both the removal of examples and attributes. This is important to speed up processing time, and in some cases can even improve classification accuracy. Generally though, classification accuracy reduces as data reduces.

Resources for Article:

Further resources on this subject:

Books to Consider

comments powered by Disqus