Reader small image

You're reading from  The Applied Artificial Intelligence Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800205819
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Anthony So
Anthony So
author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Zsolt Nagy
Zsolt Nagy
author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy

View More author details
Right arrow

3. An Introduction to Classification

Overview

This chapter introduces you to classification. You will implement various techniques, such as k-nearest neighbors and SVMs. You will use the Euclidean and Manhattan distances to work with k-nearest neighbors. You will apply these concepts to solve intriguing problems such as predicting whether a credit card applicant has a risk of defaulting and determining whether an employee would stay with a company for more than two years. By the end of this chapter, you will be confident enough to work with any data using classification and come to a certain conclusion.

Introduction

In the previous chapter, you were introduced to regression models and learned how to fit a linear regression model with single or multiple variables, as well as with a higher-degree polynomial.

Unlike regression models, which focus on learning how to predict continuous numerical values (which can have an infinite number of values), classification, which will be introduced in this chapter, is all about splitting data into separate groups, also called classes.

For instance, a model can be trained to analyze emails and predict whether they are spam or not. In this case, the data is categorized into two possible groups (or classes). This type of classification is also called binary classification, which we will see a few examples of in this chapter. However, if there are more than two groups (or classes), you will be working on a multi-class classification (you will come across some examples of this in Chapter 4, An Introduction to Decision Trees).

But what is a...

The Fundamentals of Classification

As stated earlier, the goal of any classification problem is to separate the data into relevant groups accurately using a training set. There are a lot of applications of such projects in different industries, such as education, where a model can predict whether a student will pass or fail an exam, or healthcare, where a model can assess the level of severity of a given disease for each patient.

A classifier is a model that determines the label (output) or value (class) of any data point that it belongs to. For instance, suppose you have a set of observations that contains credit-worthy individuals, and another one that contains individuals that are risky in terms of their credit repayment tendencies.

Let's call the first group P and the second one Q. Here is an example of such data:

Figure 3.1: Sample dataset

With this data, you will train a classification model that will be able to correctly classify a new observation...

Data Preprocessing

Before building a classifier, we need to format our data so that we can keep relevant data in the most suitable format for classification and remove all the data that we are not interested in.

The following points are the best ways to achieve this:

  • Replacing or dropping values:

    For instance, if there are N/A (or NA) values in the dataset, we may be better off substituting these values with a numeric value we can handle. Recall from the previous chapter that NA stands for Not Available and that it represents a missing value. We may choose to ignore rows with NA values or replace them with an outlier value.

    Note

    An outlier value is a value such as -1,000,000 that clearly stands out from regular values in the dataset.

    The fillna() method of a DataFrame does this type of replacement. The replacement of NA values with an outlier looks as follows:

    df.fillna(-1000000, inplace=True)

    The fillna() method changes all NA values into numeric values.

    This numeric value...

The K-Nearest Neighbors Classifier

Now that we have our training and testing data, it is time to prepare our classifier to perform k-nearest neighbor classification. After being introduced to the k-nearest neighbor algorithm, we will use scikit-learn to perform classification.

Introducing the K-Nearest Neighbors Algorithm (KNN)

The goal of classification algorithms is to divide data so that we can determine which data points belong to which group.

Suppose that a set of classified points is given to us. Our task is to determine which class a new data point belongs to.

In order to train a k-nearest neighbor classifier (also referred to as KNN), we need to provide the corresponding class for each observation on the training set, that is, which group it belongs to. The goal of the algorithm is to find the relevant relationship or patterns between the features that will lead to this class. The k-nearest neighbors algorithm is based on a proximity measure that calculates the...

Classification with Support Vector Machines

We first used SVMs for regression in Chapter 2, An Introduction to Regression. In this topic, you will find out how to use SVMs for classification. As always, we will use scikit-learn to run our examples in practice.

What Are Support Vector Machine Classifiers?

The goal of an SVM is to find a surface in an n-dimensional space that separates the data points in that space into multiple classes.

In two dimensions, this surface is often a straight line. However, in three dimensions, the SVM often finds a plane. These surfaces are optimal in the sense that they are based on the information available to the machine so that it can optimize the separation of the n-dimensional spaces.

The optimal separator found by the SVM is called the best separating hyperplane.

An SVM is used to find one surface that separates two sets of data points. In other words, SVMs are binary classifiers. This does not mean that SVMs can only be used for binary...

Summary

In this chapter, we learned about the basics of classification and the difference between regression problems. Classification is about predicting a response variable with limited possible values. As for any data science project, data scientists need to prepare the data before training a model. In this chapter, we learned how to standardize numerical values and replace missing values. Then, you were introduced to the famous k-nearest neighbors algorithm and discovered how it uses distance metrics to find the closest neighbors to a data point and then assigns the most frequent class among them. We also learned how to apply an SVM to a classification problem and tune some of its hyperparameters to improve the performance of the model and reduce overfitting.

In the next chapter, we will walk you through a different type of algorithm, called decision trees.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Artificial Intelligence Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781800205819
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy