Discovering the important features
We will now introduce the OneR
package to discover some of the important features of the dataset. The OneR
package will produce a single decision rule for each of the features and then rank them in terms of accuracy. Accuracy is defined as the probability of classifying the outcome correctly and can be expressed as a confusion or error matrix, which we have seen before in the previous chapters. The OneR
package has some other nice features, such as the ability to bin integer variables optimally in order to yield the best predictor.
The OneR
package does not run natively on Spark, so we first need to use the collect()
and sample()
functions to perform a 95% sample of the Spark dataframe and then move it to a local R dataframe via the collect()
function.
Although this Spark dataframe is small enough to perform the example without the sampling, it is important to know how to sample from a dataframe, since if you are using Spark as intended, your dataframes will...