Reader small image

You're reading from  The Applied Data Science Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800202504
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Alex Galea
Alex Galea
author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Right arrow

4. Training Classification Models

Overview

In this chapter, you will learn about algorithms such as Support Vector Machines, Random Forests, and k-Nearest Neighbors classifiers. While training and comparing a variety of models, you'll learn about the concept of overfitting with the help of decision boundary charts. By the end of this chapter, you will be able to use scikit-learn to apply these algorithms in order to train models for a real-world classification problem.

Introduction

In the previous chapters, we walked through the steps that we need to take in a data science project before we can train a machine learning model. This included the planning phase, that is, identifying business problems, assessing data sources for suitability, and deciding on modeling approaches.

Having decided on a general modeling approach, we should be careful to avoid the common pitfalls of training ML models as we proceed with modeling. Firstly, remember that training data is very important. In fact, increasing the amount of training data can have a larger impact than model selection on scoring performance. One issue is that there may not be enough data available, which could make patterns difficult to find and cause models to perform poorly on testing data. Data quality also has a huge effect on model performance. Some possible issues include the following:

  • Non-representative training data (sampling bias)
  • Errors in the record sets (such as recorded...

Understanding Classification Algorithms

Recall the two types of supervised machine learning: regression and classification. In regression, we predict a numerical target variable. For example, recall the linear and polynomial models from Chapter 2, Data Exploration with Jupyter. Here, we will focus on the other type of supervised machine learning—classification— the goal of which is to predict the class of a record using the available metrics. In the simplest case, there are only two possible classes, which means we are doing binary classification. This is the case for the example problem in this chapter, where we will try to predict whether an employee is going to leave. If we have more than two class labels, then we are doing multi-class classification.

Although there is little difference between binary and multi-class classification when it comes to training models with scikit-learn, the algorithms can be notably different. In particular, multi-class classification...

Summary

In this chapter, we learned about the SVM, KNN, and Random Forest classification algorithms and applied them to our preprocessed Human Resource Analytics dataset to build predictive models. These models were trained to predict whether an employee will leave the company, given a set of employee metrics.

For the purposes of keeping things simple and focusing on the algorithms, we built models that depend on only two features, that is, the satisfaction level and last evaluation value. This two-dimensional feature space also allowed us to visualize the decision boundaries and identify what overfitting looks like.

In the next chapter, we will introduce two important topics in machine learning: k-fold cross validation and validation curves. In doing so, we'll discuss more advanced topics, such as parameter tuning and model selection. Then, to optimize our final model for the employee retention problem, we'll explore feature extraction with the dimensionality reduction...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Data Science Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781800202504
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea