Reader small image

You're reading from  scikit-learn Cookbook - Second Edition

Product typeBook
Published inNov 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787286382
Edition2nd Edition
Languages
Right arrow
Author (1)
Trent Hauck
Trent Hauck
author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck

Right arrow

Support Vector Machines

In this chapter, we will cover these recipes:

  • Classifying data with a linear SVM
  • Optimizing an SVM
  • Multiclass classification with SVM
  • Support vector regression

Introduction

In this chapter, we will start by using a support vector machine (SVM) with a linear kernel to get a rough idea of how SVMs work. They create a hyperplane, or linear surface in several dimensions, which best separates the data.

In two dimensions, this is easy to see: the hyperplane is a line that separates the data. We will see the array of coefficients and intercept of the SVM. Together they uniquely describe a scikit-learn linear SVC predictor.

In the rest of the chapter, the SVMs have a radial basis function (RBF) kernel. They are nonlinear, but with smooth separating surfaces. In practice, SVMs work well with many datasets and thus are an integral part of the scikit-learn library.

Classifying data with a linear SVM

In the first chapter, we saw some examples of classification with SVMs. We focused on SVMs' slightly superior classification performance compared to logistic regression, but for the most part, we left SVMs alone.

Here, we will focus on them more closely. While SVMs do not have an easy probabilistic interpretation, they do have an easy visual-geometric one. The main idea behind linear SVMs is to separate two classes with the best possible plane.

Let's linearly separate two classes with an SVM.

Getting ready

Let us start by loading and visualizing the iris dataset available in scikit-learn:

...

Optimizing an SVM

For this example we will continue with the iris dataset, but will use two classes that are harder to tell apart, the Versicolour and Virginica iris species.

In this section we will focus on the following:

  • Setting up a scikit-learn pipeline: A chain of transformations with a predictive model at the end
  • A grid search: A performance scan of several versions of SVMs with varying parameters

Getting ready

Load two classes and two features of the iris dataset:

#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets

iris = datasets.load_iris()
X_w = iris.data[:, :2] #load the first two features of the iris data
y_w = iris.target ...

Multiclass classification with SVM

We begin expanding the previous recipe to classify all iris flower types based on two features. This is not a binary classification problem, but a multiclass classification problem. These steps expand on the previous recipe.

Getting ready

The SVC classifier (scikit's SVC) can be changed slightly in the case of multiclass classifications. For this, we will use all three classes of the iris dataset.

Load two features for each class:

#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, :2] #load the first two features of the iris data
y = iris.target #load...

Support vector regression

We will capitalize on the SVM classification recipes by performing support vector regression on scikit-learn's diabetes dataset.

Getting ready

Load the diabetes dataset:

#load the libraries we have been using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets

diabetes = datasets.load_diabetes()

X = diabetes.data
y = diabetes.target

Split the data in training and testing sets. There is no stratification for regression in this case:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
scikit-learn Cookbook - Second Edition
Published in: Nov 2017Publisher: PacktISBN-13: 9781787286382
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck