Learning scikit-learn: Machine Learning in Python — Save 50%
Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library with this book and ebook
This article, by Raúl Garreta and Guillermo Moncecchi, the Authors of Learning scikit-learn: Machine Learning in Python, presents the main concepts behind Machine Learning while solving a simple classification problem.
(For more resources related to this topic, see here.)
To get a grip on the problem of machine learning in scikit-learn, we will start with a very simple machine learning problem: we will try to predict the Iris flower species using only two attributes: sepal width and sepal length. This is an instance of a classification problem, where we want to assign a label (a value taken from a discrete set) to an item according to its features.
Let's first build our training dataset—a subset of the original sample, represented by the two attributes we selected and their respective target values. After importing the dataset, we will randomly select about 75 percent of the instances, and reserve the remaining ones (the evaluation dataset) for evaluation purposes (we will see later why we should always do that):
>>> from sklearn.cross_validation import train_test_split >>> from sklearn import preprocessing >>> # Get dataset with only the first two attributes >>> X, y = X_iris[:, :2], y_iris >>> # Split the dataset into a training and a testing set >>> # Test set will be the 25% taken randomly >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33) >>> print X_train.shape, y_train.shape (112, 2) (112,) >>> # Standardize the features >>> scaler = preprocessing.StandardScaler().fit(X_train) >>> X_train = scaler.transform(X_train) >>> X_test = scaler.transform(X_test)
The train_test_split function automatically builds the training and evaluation datasets, randomly selecting the samples. Why not just select the first 112 examples? This is because it could happen that the instance ordering within the sample could matter and that the first instances could be different to the last ones. In fact, if you look at the Iris datasets, the instances are ordered by their target class, and this implies that the proportion of 0 and 1 classes will be higher in the new training set, compared with that of the original dataset. We always want our training data to be a representative sample of the population they represent.
The last three lines of the previous code modify the training set in a process usually called feature scaling. For each feature, calculate the average, subtract the mean value from the feature value, and divide the result by their standard deviation. After scaling, each feature will have a zero average, with a standard deviation of one. This standardization of values (which does not change their distribution, as you could verify by plotting the X values before and after scaling) is a common requirement of machine learning methods, to avoid that features with large values may weight too much on the final results.
Now, let's take a look at how our training instances are distributed in the two-dimensional space generated by the learning feature. pyplot, from the matplotlib library, will help us with this:
>>> import matplotlib.pyplot as plt >>> colors = ['red', 'greenyellow', 'blue'] >>> for i in xrange(len(colors)): >>> xs = X_train[:, 0][y_train == i] >>> ys = X_train[:, 1][y_train == i] >>> plt.scatter(xs, ys, c=colors[i]) >>> plt.legend(iris.target_names) >>> plt.xlabel('Sepal length') >>> plt.ylabel('Sepal width')
The scatter function simply plots the first feature value (sepal width) for each instance versus its second feature value (sepal length) and uses the target class values to assign a different color for each class. This way, we can have a pretty good idea of how these attributes contribute to determine the target class. The following screenshot shows the resulting plot:
Looking at the preceding screenshot, we can see that the separation between the red dots (corresponding to the Iris setosa) and green and blue dots (corresponding to the two other Iris species) is quite clear, while separating green from blue dots seems a very difficult task, given the two features available. This is a very common scenario: one of the first questions we want to answer in a machine learning task is if the feature set we are using is actually useful for the task we are solving, or if we need to add new attributes or change our method.
Given the available data, let's, for a moment, redefine our learning task: suppose we aim, given an Iris flower instance, to predict if it is a setosa or not. We have converted our problem into a binary classification task (that is, we only have two possible target classes).
If we look at the picture, it seems that we could draw a straight line that correctly separates both the sets (perhaps with the exception of one or two dots, which could lie in the incorrect side of the line). This is exactly what our first classification method, linear classification models, tries to do: build a line (or, more generally, a hyperplane in the feature space) that best separates both the target classes, and use it as a decision boundary (that is, the class membership depends on what side of the hyperplane the instance is).
To implement linear classification, we will use the SGDClassifier from scikit-learn. SGD stands for Stochastic Gradient Descent, a very popular numerical procedure to find the local minimum of a function (in this case, the loss function, which measures how far every instance is from our boundary). The algorithm will learn the coefficients of the hyperplane by minimizing the loss function.
To use any method in scikit-learn, we must first create the corresponding classifier object, initialize its parameters, and train the model that better fits the training data. You will see while you advance that this procedure will be pretty much the same for what initially seemed very different tasks.
>>> from sklearn.linear_modelsklearn._model import SGDClassifier >>> clf = SGDClassifier() >>> clf.fit(X_train, y_train)</p></pre>
The SGDClassifier initialization function allows several parameters. For the moment, we will use the default values, but keep in mind that these parameters could be very important, especially when you face more real-world tasks, where the number of instances (or even the number of attributes) could be very large. The fit function is probably the most important one in scikit-learn. It receives the training data and the training classes, and builds the classifier. Every supervised learning method in scikit-learn implements this function.
What does the classifier look like in our linear model method? As we have already said, every future classification decision depends just on a hyperplane. That hyperplane is, then, our model. The coef_ attribute of the clf object (consider, for the moment, only the first row of the matrices), now has the coefficients of the linear boundary and the intercept_ attribute, the point of intersection of the line with the y axis. Let's print them:
>>> print clf.coef_ [[-28.53692691 15.05517618] [ -8.93789454 -8.13185613] [ 14.02830747 -12.80739966]] >>> print clf.intercept_ [-17.62477802 -2.35658325 -9.7570213 ]
Indeed in the real plane, with these three values, we can draw a line, represented by the following equation:
-17.62477802 - 28.53692691 * x1 + 15.05517618 * x2 = 0
Now, given x1 and x2 (our real-valued features), we just have to compute the value of the left-side of the equation: if its value is greater than zero, then the point is above the decision boundary (the red side), otherwise it will be beneath the line (the green or blue side). Our prediction algorithm will simply check this and predict the corresponding class for any new iris flower.
But, why does our coefficient matrix have three rows? Because we did not tell the method that we have changed our problem definition (how could we have done this?), and it is facing a three-class problem, not a binary decision problem. What, in this case, the classifier does is the same we did—it converts the problem into three binary classification problems in a one-versus-all setting (it proposes three lines that separate a class from the rest).
The following code draws the three decision boundaries and lets us know if they worked as expected:
>>> x_min, x_max = X_train[:, 0].min() - .5, X_train[:, 0].max() + .5 >>> y_min, y_max = X_train[:, 1].min() - .5, X_train[:, 1].max() + .5 >>> xs = np.arange(x_min, x_max, 0.5) >>> fig, axes = plt.subplots(1, 3) >>> fig.set_size_inches(10, 6) >>> for i in [0, 1, 2]: >>> axes[i].set_aspect('equal') >>> axes[i].set_title('Class '+ str(i) + ' versus the rest') >>> axes[i].set_xlabel('Sepal length') >>> axes[i].set_ylabel('Sepal width') >>> axes[i].set_xlim(x_min, x_max) >>> axes[i].set_ylim(y_min, y_max) >>> sca(axes[i]) >>> plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.prism) >>> ys = (-clf.intercept_[i] – Xs * clf.coef_[i, 0]) / clf.coef_[i, 1] >>> plt.plot(xs, ys, hold=True)
The first plot shows the model built for our original binary problem. It looks like the line separates quite well the Iris setosa from the rest. For the other two tasks, as we expected, there are several points that lie on the wrong side of the hyperplane.
Now, the end of the story: suppose that we have a new flower with a sepal width of 4.7 and a sepal length of 3.1, and we want to predict its class. We just have to apply our brand new classifier to it (after normalizing!). The predict method takes an array of instances (in this case, with just one element) and returns a list of predicted classes:
>>>print clf.predict(scaler.transform([[4.7, 3.1]])) 
If our classifier is right, this Iris flower is a setosa. Probably, you have noticed that we are predicting a class from the possible three classes but that linear models are essentially binary: something is missing. You are right. Our prediction procedure combines the result of the three binary classifiers and selects the class in which it is more confident. In this case, we will select the boundary line whose distance to the instance is longer. We can check that using the classifier decision_function method:
>>>print clf.decision_function(scaler.transform([[4.7, 3.1]])) [[ 19.73905808 8.13288449 -28.63499119]]
In this article we included a very simple example of classification, trying to show the main steps for learning.
Resources for Article:
- Python Testing: Installing the Robot Framework [Article]
- Inheritance in Python [Article]
- Python 3: Object-Oriented Design [Article]
|Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library with this book and ebook|
eBook Price: $17.99
Book Price: $29.99
About the Author :
Guillermo Moncecchi is a Natural Language Processing researcher at the Universidad de la República of Uruguay. He received a PhD in Informatics from the Universidad de la República, Uruguay and a Ph.D in Language Sciences from the Université Paris Ouest, France. He has participated in several international projects on NLP. He has almost 15 years of teaching experience on Automata Theory, Natural Language Processing, and Machine Learning.
He also works as Head Developer at the Montevideo Council and has lead the development of several public services for the council, particularly in the Geographical Information Systems area. He is one of the Montevideo Open Data movement leaders, promoting the publication and exploitation of the city's data.
Raúl Garreta is a Computer Engineer with much experience in the theory and application of Artificial Intelligence (AI), where he specialized in Machine Learning and Natural Language Processing (NLP).
He has an entrepreneur profile with much interest in the application of science, technology, and innovation to the Internet industry and startups. He has worked in many software companies, handling everything from video games to implantable medical devices.
In 2009, he co-founded Tryolabs with the objective to apply AI to the development of intelligent software products, where he performs as the CTO and Product Manager of the company. Besides the application of Machine Learning and NLP, Tryolabs' expertise lies in the Python programming language and has been catering to many clients in Silicon Valley. Raul has also worked in the development of the Python community in Uruguay, co-organizing local PyDay and PyCon conferences.
He is also an assistant professor at the Computer Science Institute of Universidad de la República in Uruguay since 2007, where he has been working on the courses of Machine Learning, NLP, as well as Automata Theory and Formal Languages. Besides this, he is finishing his Masters degree in Machine Learning and NLP. He is also very interested in the research and application of Robotics, Quantum Computing, and Cognitive Modeling. Not only is he a technology enthusiast and science fiction lover (geek) but also a big fan of arts, such as cinema, photography, and painting.