Reader small image

You're reading from  Applied Supervised Learning with Python

Product typeBook
Published inApr 2019
Reading LevelIntermediate
Publisher
ISBN-139781789954920
Edition1st Edition
Languages
Right arrow
Authors (2):
Benjamin Johnston
Benjamin Johnston
author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

Ishita Mathur
Ishita Mathur
author image
Ishita Mathur

Ishita Mathur has worked as a data scientist for 2.5 years with product-based start-ups working with business concerns in various domains and formulating them as technical problems that can be solved using data and machine learning. Her current work at GO-JEK involves the end-to-end development of machine learning projects, by working as part of a product team on defining, prototyping, and implementing data science models within the product. She completed her masters' degree in high-performance computing with data science at the University of Edinburgh, UK, and her bachelor's degree with honors in physics at St. Stephen's College, Delhi.
Read more about Ishita Mathur

View More author details
Right arrow

Chapter 5. Ensemble Modeling

Note

Learning Objectives

By the end of the chapter, you will be able to:

  • Explain the concepts of bias and variance and how they lead to underfitting and overfitting

  • Explain the concepts behind bootstrapping

  • Implement a bagging classifier using decision trees

  • Implement adaptive boosting and gradient boosting models

  • Implement a stacked ensemble using a number of classifiers

Note

This chapter covers bias and variance, and underfitting and overfitting, and then introduces ensemble modeling.

Introduction


In the previous chapters, we discussed the two types of supervised learning problems: regression and classification. We looked at a number of algorithms for each type and delved into how those algorithms worked.

But there are times when these algorithms, no matter how complex they are, just don't seem to perform well on the data that we have. There could be a variety of causes and reasons – perhaps the data is not good enough, perhaps there really is no trend where we are trying to find one, or perhaps the model itself is too complex.

Wait. What? How can a model being too complex be a problem? Oh, but it can! If a model is too complex and there isn't enough data, the model could fit so well to the data that it learns even the noise and outliers, which is never what we want.

Oftentimes, where a single complex algorithm can give us a result that is way off, aggregating the results from a group of models can give us a result that's closer to the actual truth. This is because there...

Overfitting and Underfitting


Let's say we fit a supervised learning algorithm to our data and subsequently use the model to perform a prediction on a hold-out validation set. The performance of this model will be considered to be good based on how well it generalizes, that is, the predictions it makes for data points in an independent validation dataset.

Sometimes we find that the model is not able to make accurate predictions and gives poor performance on the validation data. This poor performance can be the result of a model that is too simple to model the data appropriately, or a model that is too complex to generalize to the validation dataset. In the former case, the model has a high bias and results in underfitting, while in the latter case, the model has a high variance and results in overfitting.

Bias

The bias in the prediction of a machine learning model represents the difference between the predicted values and the true values. A model is said to have a high bias if the average predicted...

Bagging


The term bagging is derived from a technique called bootstrap aggregation. In order to implement a successful predictive model, it's important to know in what situation we could benefit from using bootstrapping methods to build ensemble models. In this section, we'll talk about a way to use bootstrap methods to create an ensemble model that minimizes variance and look at how we can build an ensemble of decision trees, that is, the Random Forest algorithm. But what is bootstrapping and how does it help us build robust ensemble models?

Bootstrapping

The bootstrap method refers to random sampling with replacement, that is, drawing multiple samples (each known as a resample) from the dataset consisting of randomly chosen data points, where there can be an overlap in the data points contained in each resample and each data point has an equal probability of being selected from the overall dataset:

Figure 5.5: Randomly choosing data points

From the previous diagram, we can see that each of...

Boosting


The second ensemble technique we'll be looking at is boosting, which involves incrementally training new models that focus on the misclassified data points in the previous model and utilizes weighted averages to turn weak models (underfitting models having high bias) into stronger models. Unlike bagging, where each base estimator could be trained independently of the others, the training of each base estimator in a boosted algorithm depends on the previous one.

Although boosting also uses the concept of bootstrapping, it's done differently from bagging, since each sample of data is weighted, implying that some bootstrapped samples can be used for training more often than other samples. When training each model, the algorithm keeps track of which features are most useful and which data samples have the most prediction error; these are given higher weightage and are considered to require more iterations to properly train the model.

When predicting the output, the boosting ensemble takes...

Summary


In this chapter, we started off with a discussion on overfitting and underfitting and how these can affect the performance of a model on unseen data. The chapter looked at ensemble modeling as a solution for these and went on to discuss different ensemble methods that could be used, and how they could decrease the overall bias or variance encountered when making predictions.

We first discussed bagging algorithms and introduced the concept of bootstrapping. Then, we looked at Random Forest as a classic example of a Bagged ensemble and solved exercises that involved building a bagging classifier and Random Forest classifier on the previously seen Titanic dataset.

We then moved on to discussing boosting algorithms, how they successfully reduce bias in the system, and gained an understanding of how to implement adaptive boosting and gradient boosting. The last ensemble method we discussed was stacking, which, as we saw from the exercise, gave us the best accuracy score of all the ensemble...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Applied Supervised Learning with Python
Published in: Apr 2019Publisher: ISBN-13: 9781789954920
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

author image
Ishita Mathur

Ishita Mathur has worked as a data scientist for 2.5 years with product-based start-ups working with business concerns in various domains and formulating them as technical problems that can be solved using data and machine learning. Her current work at GO-JEK involves the end-to-end development of machine learning projects, by working as part of a product team on defining, prototyping, and implementing data science models within the product. She completed her masters' degree in high-performance computing with data science at the University of Edinburgh, UK, and her bachelor's degree with honors in physics at St. Stephen's College, Delhi.
Read more about Ishita Mathur