Predicting Categories with Naive Bayes and SVMs

In this chapter, you will learn about two popular classification machine learning algorithms: the Naive Bayes algorithm and the linear support vector machine. The Naive Bayes algorithm is a probabilistic model that predicts classes and categories, while the linear support vector machine uses a linear decision boundary to predict classes and categories.

In this chapter, you will learn about the following topics:

The theoretical concept behind the Naive Bayes algorithm, explained in mathematical terms
Implementing the Naive Bayes algorithm by using scikit-learn
How the linear support vector machine algorithm works under the hood
Graphically optimizing the hyperparameters of the linear support vector machines

Technical requirements

You will be required to have Python 3.6 or greater, Pandas ≥ 0.23.4, Scikit-learn ≥ 0.20.0, and Matplotlib ≥ 3.0.0 installed on your system.

The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Machine-Learning-with-scikit-learn-Quick-Start-Guide/blob/master/Chapter_04.ipynb

Check out the following video to see the code in action:

http://bit.ly/2COBMUj

The Naive Bayes algorithm

The Naive Bayes algorithm makes use of the Bayes theorem, in order to classify classes and categories. The word naive was given to the algorithm because the algorithm assumes that all attributes are independent of one another. This is not actually possible, as every attribute/feature in a dataset is related to another attribute, in one way or another.

Despite being naive, the algorithm does well in actual practice. The formula for the Bayes theorem is as follows:

Bayes theorem formula

We can split the preceding algorithm into the following components:

p(h|D): This is the probability of a hypothesis taking place, provided that we have a dataset. An example of this would be the probability of a fraudulent transaction taking place, provided that we had a dataset that consisted of fraudulent and non-fraudulent transactions.
p(D|h): This is the probability...

Support vector machines

In this section, you will learn about support vector machines (SVMs), or, to be more specific, linear support vector machines. In order to understand support vector machines, you will need to know what support vectors are. They are illustrated for you in the following diagram:

The concept of support vectors

In the preceding diagram, the following applies:

The linear support vector machine is a form of linear classifier. A linear decision tree boundary is constructed, and the observations on one side of the boundary (the circles) belong to one class, while the observations on the other side of the boundary (the squares) belong to another class.
The support vectors are the observations that have a triangle on them.
These are the observations that are either very close to the linear decision boundary or have been incorrectly classified.
We can define...

Summary

This chapter introduced you to two fundamental supervised machine learning algorithms: the Naive Bayes algorithm and linear support vector machines. More specifically, you learned about the following topics:

How the Bayes theorem is used to produce a probability, to indicate whether a data point belongs to a particular class or category
Implementing the Naive Bayes classifier in scikit-learn
How the linear support vector machines work under the hood
Implementing the linear support vector machines in scikit-learn
Optimizing the inverse regularization strength, both graphically and by using the GridSearchCV algorithm
How to scale your data for a potential improvement in performance

In the next chapter, you will learn about the other type of supervised machine learning algorithm, which is used to predict numeric values, rather than classes and categories: linear regression...