In this chapter, we will cover these recipes:
- Classifying data with a linear SVM
- Optimizing an SVM
- Multiclass classification with SVM
- Support vector regression
In this chapter, we will cover these recipes:
In this chapter, we will start by using a support vector machine (SVM) with a linear kernel to get a rough idea of how SVMs work. They create a hyperplane, or linear surface in several dimensions, which best separates the data.
In two dimensions, this is easy to see: the hyperplane is a line that separates the data. We will see the array of coefficients and intercept of the SVM. Together they uniquely describe a scikit-learn linear SVC predictor.
In the rest of the chapter, the SVMs have a radial basis function (RBF) kernel. They are nonlinear, but with smooth separating surfaces. In practice, SVMs work well with many datasets and thus are an integral part of the scikit-learn library.
In the first chapter, we saw some examples of classification with SVMs. We focused on SVMs' slightly superior classification performance compared to logistic regression, but for the most part, we left SVMs alone.
Here, we will focus on them more closely. While SVMs do not have an easy probabilistic interpretation, they do have an easy visual-geometric one. The main idea behind linear SVMs is to separate two classes with the best possible plane.
Let's linearly separate two classes with an SVM.
Let us start by loading and visualizing the iris dataset available in scikit-learn:
For this example we will continue with the iris dataset, but will use two classes that are harder to tell apart, the Versicolour and Virginica iris species.
In this section we will focus on the following:
Load two classes and two features of the iris dataset:
#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X_w = iris.data[:, :2] #load the first two features of the iris data
y_w = iris.target ...
We begin expanding the previous recipe to classify all iris flower types based on two features. This is not a binary classification problem, but a multiclass classification problem. These steps expand on the previous recipe.
The SVC classifier (scikit's SVC) can be changed slightly in the case of multiclass classifications. For this, we will use all three classes of the iris dataset.
Load two features for each class:
#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] #load the first two features of the iris data
y = iris.target #load...
We will capitalize on the SVM classification recipes by performing support vector regression on scikit-learn's diabetes dataset.
Load the diabetes dataset:
#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
Split the data in training and testing sets. There is no stratification for regression in this case:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)