Reader small image

You're reading from  Machine Learning with scikit-learn Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343700
Edition1st Edition
Languages
Right arrow
Author (1)
Kevin Jolly
Kevin Jolly
author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly

Right arrow

Classification and Regression with Trees

Tree based algorithms are very popular for two reasons: they are interpretable, and they make sound predictions that have won many machine learning competitions on online platforms, such as Kaggle. Furthermore, they have many use cases outside of machine learning for solving problems, both simple and complex.

Building a tree is an approach to decision-making used in almost all industries. Trees can be used to solve both classification- and regression-based problems, and have several use cases that make them the go-to solution!

This chapter is broadly divided into the following two sections:

  • Classification trees
  • Regression trees

Each section will cover the fundamental theory of different types of tree based algorithms, along with their implementation in scikit-learn. By the end of this chapter, you will have learned how to aggregate several...

Technical requirements

Classification trees

Classification trees are used to predict a category or class. This is similar to the classification algorithms that you have learned about previously in this book, such as the k-nearest neighbors algorithm or logistic regression.

Broadly speaking, there are three tree based algorithms that are used to solve classification problems:

  • The decision tree classifier
  • The random forest classifier
  • The AdaBoost classifier

In this section, you will learn how each of these tree based algorithms works, in order to classify a row of data as a particular class or category.

The decision tree classifier

The decision tree is the simplest tree based algorithm, and serves as the foundation for the other two algorithms....

Regression trees

You have learned how trees are used in order to classify a prediction as belonging to a particular class or category. However, trees can also be used to solve problems related to predicting numeric outcomes. In this section, you will learn about the three types of tree based algorithms that you can implement in scikit-learn in order to predict numeric outcomes, instead of classes:

  • The decision tree regressor
  • The random forest regressor
  • The gradient boosted tree

The decision tree regressor

When we have data that is non-linear in nature, a linear regression model might not be the best model to choose. In such situations, it makes sense to choose a model that can fully capture the non-linearity of such data...

Ensemble classifier

The concept of ensemble learning was explored in this chapter, when we learned about random forests, AdaBoost, and gradient boosted trees. However, this concept can be extended to classifiers outside of trees.

If we had built a logistic regression, random forest, and k-nearest neighbors classifiers, and we wanted to group them all together and extract the final prediction through majority voting, then we could do this by using the ensemble classifier.

This concept can be better understood with the aid of the following diagram:

Ensemble learning with a voting classifier to predict fraud transactions

When examining the preceding diagram, note the following:

  • The random forest classifier predicted that a particular transaction was fraudulent, while the other two classifiers predicted that the transaction was not fraudulent.
  • The voting classifier sees that two...

Summary

While this chapter was rather long, you have entered the world of tree based algorithms, and left with a wide arsenal of tools that you can implement in order to solve both small- and large-scale problems. To summarize, you have learned the following:

  • How to use decision trees for classification and regression
  • How to use random forests for classification and regression
  • How to use AdaBoost for classification
  • How to use gradient boosted trees for regression
  • How the voting classifier can be used to build a single model out of different models

In the upcoming chapter, you will learn how you can work with data that does not have a target variable or labels, and how to perform unsupervised machine learning in order to solve such problems!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with scikit-learn Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789343700
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly