This chapter describes how genetic algorithms can be used to improve the performance of supervised machine learning models by selecting the best subset of features from the provided input data. This chapter will start with a brief introduction to machine learning and then describe the two main types of supervised machine learning tasks – regression and classification. We will then discuss the potential benefits of feature selection when it comes to the performance of these models. Next, we will demonstrate how genetic algorithms can be utilized to pinpoint the genuine features that are generated by the Friedman-1 Test regression problem. Then, we will use the real-life Zoo dataset to create a classification model and improve its accuracy – again by applying genetic algorithms to isolate the best features for...
You're reading from Hands-On Genetic Algorithms with Python
Technical requirements
In this chapter, we will be using Python 3 with the following supporting libraries:
- deap
- numpy
- pandas
- matplotlib
- seaborn
- sklearn – introduced in this chapter
In addition, we will be using the UCI Zoo Dataset (https://archive.ics.uci.edu/ml/datasets/zoo).
The programs that will be used in this chapter can be found in this book's GitHub repository at https://github.com/PacktPublishing/Hands-On-Genetic-Algorithms-with-Python/tree/master/Chapter07.
Check out the following video to see the Code in Action:
http://bit.ly/37HCKyr
Supervised machine learning
The term machine learning typically refers to a computer program that receives inputs and produces outputs. Our goal is to train this program, also known as the model, to produce the correct outputs for the given inputs, without explicitly programming them.
During this training process, the model learns the mapping between the inputs and the outputs by adjusting its internal parameters. One common way to train the model is by providing it with a set of inputs, for which the correct output is known. For each of these inputs, we tell the model what the correct output is so that it can adjust, or tune itself, aiming to eventually produce the desired output for each of the given inputs. This tuning is at the heart of the learning process.
Over the years, many types of machine learning models have been developed. Each model has its own particular internal...
Feature selection in supervised learning
As we saw in the previous section, a supervised learning model receives a set of inputs, called features, and maps them to a set of outputs. The assumption is that the information described by the features is useful for determining the value of the corresponding outputs. At first glance, it may seem that the more information we can use as input, the better our chances of predicting the output(s) correctly. However, in many cases, the opposite holds true; if some of the features we use are irrelevant or redundant, the consequence could be a (sometimes significant) decrease in the accuracy of the models.
Feature selection is the process of selecting the most beneficial and essential set of features out of the entire given set of features. Besides increasing the accuracy of the model, a successful feature selection can provide the following...
Selecting the features for the Friedman-1 regression problem
The Friedman-1 regression problem, which was created by Friedman and Breiman, describes a single output value, y, which is a function of five input values, x0..x4, and randomly generated noise, according to the following formula:
The input variables, x0..x4, are independent, and uniformly distributed over the interval [0, 1]. The last component in the formula is the randomly generated noise. The noise is normally distributed and multiplied by the constant noise, which determines its level.
In Python, the scikit-learn (sklearn) library provides us with the make_friedman1() function, which can be used to generate a dataset containing the desired number of samples. Each of the samples consists of randomly generated x0..x4 values and their corresponding calculated y value. The interesting part, however, is that we can tell...
Selecting the features for the classification Zoo dataset
The UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php) maintains over 350 datasets as a service to the machine learning community. These datasets can be used for experimentation with various models and algorithms. A typical dataset contains a number of features (inputs) and the desired output, in a form of columns, with a description of their meaning.
In this section, we will use the UCI Zoo dataset (https://archive.ics.uci.edu/ml/datasets/zoo). This dataset describes 101 different animals using the following 18 features:
No. |
Feature Name |
Data Type |
1 |
animal name |
Unique for each instance |
2 |
hair |
Boolean |
3 |
feathers |
Boolean |
4 |
eggs |
Boolean |
5 |
milk |
Boolean |
6 |
airborne |
Boolean |
7 |
aquatic |
Boolean |
8 |
predator |
Boolean |
Summary
In this chapter, you were introduced to machine learning and the two main types of supervised machine learning tasks – regression and classification. Then, you were presented with the potential benefits of feature selection on the performance of the models carrying out these tasks. At the heart of this chapter were two demonstrations of how genetic algorithms can be utilized to enhance the performance of such models via feature selection. In the first case, we pinpointed the genuine features that were generated by the Friedman-1 Test regression problem, while, in the other case, we selected the most beneficial features of the Zoo classification dataset.
In the next chapter, we will look at another possible way of enhancing the performance of supervised machine learning models, namely hyperparameter tuning.
Further reading
For more information about the topics that were covered in this chapter, please refer to the following resources:
- Applied Supervised Learning with Python, Benjamin Johnston and Ishita Mathur, April 26, 2019
- Feature Engineering Made Easy, Sinan Ozdemir and Divya Susarla, January 22, 2018
- Feature selection for classification, M.Dash and H.Liu, 1997: https://doi.org/10.1016/S1088-467X(97)00008-5
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php