Reader small image

You're reading from  Data Science for Marketing Analytics - Second Edition

Product typeBook
Published inSep 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800560475
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Authors (3):
Mirza Rahim Baig
Mirza Rahim Baig
author image
Mirza Rahim Baig

Mirza Rahim Baig is a Data Science and Artificial Intelligence leader with over 13 years of experience across e-commerce, healthcare, and marketing. He currently holds the position of leading Product Analytics at Marketing Services for Zalando, Europe's largest online fashion platform. In addition, he serves as a Subject Matter Expert and faculty member for MS level programs at prominent Ed-Tech platforms and institutes in India. He is also the lead author of two books, 'Data Science for Marketing Analytics' and 'The Deep Learning Workshop,' both published by Packt. He is recognized as a thought leader in my field and frequently participates as a guest speaker at various forums.
Read more about Mirza Rahim Baig

Gururajan Govindan
Gururajan Govindan
author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Vishwesh Ravi Shrimali
Vishwesh Ravi Shrimali
author image
Vishwesh Ravi Shrimali

Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. He also completed his Masters in Machine Learning and AI from LJMU in 2021. He has authored - Machine learning for OpenCV (2nd edition), Computer Vision Workshop and Data Science for Marketing Analytics (2nd edition) by Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar.
Read more about Vishwesh Ravi Shrimali

View More author details
Right arrow

8. Fine-Tuning Classification Algorithms

Overview

This chapter will help you optimize predictive analytics using classification algorithms such as support vector machines, decision trees, and random forests, which are some of the most common classification algorithms from the scikit-learn machine learning library. Moreover, you will learn how to implement tree-based classification models, which you have used previously for regression. Next, you will learn how to choose appropriate performance metrics for evaluating the performance of a classification model. Finally, you will put all these skills to use in solving a customer churn prediction problem where you will optimize and evaluate the best classification algorithm for predicting whether a given customer will churn or not.

Introduction

Consider a scenario where you are the machine learning lead in a marketing analytics firm. Your firm has taken over a project from Amazon to predict whether or not a user will buy a product during festive season sale campaigns. You have been provided with anonymized data about customer activity on the Amazon website – the number of products purchased, their prices, categories of the products, and more. In such scenarios, where the target variable is a discrete value – for example, the customer will either buy the product or not – the problems are referred to as classification problems. There are a large number of classification algorithms available now to solve such problems and choosing the right one is a crucial task. So, you will first start exploring the dataset to come up with some observations about it. Next, you will try out different classification algorithms and evaluate the performance metrics for each classification model to understand whether...

Support Vector Machines

When dealing with data that is linearly separable, the goal of the Support Vector Machine (SVM) learning algorithm is to find the boundary between classes so that there are fewer misclassification errors. However, the problem is that there could be several decision boundaries (B1, B2), as you can see in the following figure:

Figure 8.1: Multiple decision boundary

As a result, the question arises as to which of the boundaries is better, and how to define better. The solution is to use a margin as the optimization objective. A margin can be described as the distance between the boundary and two points (from different classes) lying closest to the boundary. Figure 8.2 gives a nice visual definition of the margin.

The objective of the SVM algorithm is to maximize the margin. You will go over the intuition behind maximizing the margin in the next section. For now, you need to understand that the objective of an SVM linear classifier is to increase the width...

Decision Trees

Decision trees are mostly used for classification tasks. They are a non-parametric form of supervised learning method, meaning that unlike in SVM where you had to specify the kernel type, C, gamma, and other parameters, there are no such parameters to be specified in the case of decision trees. This also makes them quite easy to work with. Decision trees, as the name suggests, use a tree-based structure for making a decision (finding the target variable). Each "branch" of the decision tree is made by following a rule, for example, "is some feature more than some value? – yes or no." Decision trees can be used both as regressors and classifiers with minimal changes. The following are the advantages and disadvantages of using decision trees for classification:

Advantages

  • Decision trees are easy to understand and visualize.
  • They can handle both numeric and categorical data.
  • The requirement for data cleaning in the case of decision...

Random Forest

The decision tree algorithm that you saw earlier faced the problem of overfitting. Since you fit only one tree on the training data, there is a high chance that the tree will overfit the data without proper pruning. For example, referring to the Amazon sales case study that we discussed at the start of this chapter, if your model learns to focus on the inherent randomness in the data, it will try to use that as a baseline for future predictions. Consider a scenario where out of 100 customers, 90 bought a beard wash, primarily because most of them were males with a beard.

However, your model started thinking that this is not related to gender, so the next time someone logs in during the sale, it will start recommending beard wash, even if that person might be female. Unfortunately, these things are very common but can really harm the business. This is why it is important to treat the overfitting of models. The random forest algorithm reduces variance/overfitting by averaging...

Preprocessing Data for Machine Learning Models

Preprocessing data before training any machine learning model can improve the accuracy of the model to a large extent. Therefore, it is important to preprocess data before training a machine learning algorithm on the dataset. Preprocessing data consists of the following methods: standardization, scaling, and normalization. Let's look at these methods one by one.

Standardization

Most machine learning algorithms assume that all features are centered at zero and have variance in the same order. In the case of linear models such as logistic and linear regression, some of the parameters used in the objective function assume that all the features are centered around zero and have unit variance. If the values of a feature are much higher than some of the other features, then that feature might dominate the objective function and the estimator may not be able to learn from other features. In such cases, standardization can be used to rescale...

Model Evaluation

When you train your model, you usually split the data into training and testing datasets. This is to ensure that the model doesn't overfit. Overfitting refers to a phenomenon where a model performs very well on the training data but fails to give good results on testing data, or in other words, the model fails to generalize.

In scikit-learn, you have a function known as train_test_split that splits the data into training and testing sets randomly.

When evaluating your model, you start by changing the parameters to improve the accuracy as per your test data. There is a high chance of leaking some of the information from the testing set into your training set if you optimize your parameters using only the testing set data. To avoid this, you can split data into three parts—training, testing, and validation sets. However, the disadvantage of this technique is that you will be further reducing your training dataset.

The solution is to use cross-validation...

Performance Metrics

In the case of classification algorithms, we use a confusion matrix, which gives us the performance of the learning algorithm. It is a square matrix that counts the number of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) outcomes:

Figure 8.45: Confusion matrix

For the sake of simplicity, let's use 1 as the positive class and 0 as a negative class, then:

TP: The number of cases that were observed and predicted as 1.

FN: The number of cases that were observed as 1 but predicted as 0.

FP: The number of cases that were observed as 0 but predicted as 1.

TN: The number of cases that were observed as 0 and predicted as 0.

Consider the same case study of predicting whether a product will be returned or not. In that case, the preceding metrics can be understood using the following table:

Figure 8.46: Understanding the metrics

Precision

Precision is the ability of a classifier to not label a sample that is...

Summary

In this chapter, you learned how to perform classification using some of the most commonly used algorithms. After discovering how tree-based models work, you were able to calculate information gain, Gini values, and entropy. You applied these concepts to train decision tree and random forest models on two datasets.

Later in the chapter, you explored why the preprocessing of data using techniques such as standardization is necessary. You implemented various fine-tuning techniques for optimizing a machine learning model. Next, you identified the right performance metrics for your classification problems and visualized performance summaries using a confusion matrix. You also explored other evaluation metrics including precision, recall, F1 score, ROC curve, and the area under the curve.

You implemented these techniques on case studies such as the telecom dataset and customer churn prediction and discovered how similar approaches can be followed in predicting whether a customer...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Science for Marketing Analytics - Second Edition
Published in: Sep 2021Publisher: PacktISBN-13: 9781800560475
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Mirza Rahim Baig

Mirza Rahim Baig is a Data Science and Artificial Intelligence leader with over 13 years of experience across e-commerce, healthcare, and marketing. He currently holds the position of leading Product Analytics at Marketing Services for Zalando, Europe's largest online fashion platform. In addition, he serves as a Subject Matter Expert and faculty member for MS level programs at prominent Ed-Tech platforms and institutes in India. He is also the lead author of two books, 'Data Science for Marketing Analytics' and 'The Deep Learning Workshop,' both published by Packt. He is recognized as a thought leader in my field and frequently participates as a guest speaker at various forums.
Read more about Mirza Rahim Baig

author image
Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

author image
Vishwesh Ravi Shrimali

Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. He also completed his Masters in Machine Learning and AI from LJMU in 2021. He has authored - Machine learning for OpenCV (2nd edition), Computer Vision Workshop and Data Science for Marketing Analytics (2nd edition) by Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar.
Read more about Vishwesh Ravi Shrimali