You're reading from Data Science Projects with Python - Second Edition

Product type Book

Published in Jul 2021

Publisher Packt

ISBN-13 9781800564480

Pages 432 pages

Edition 2nd Edition

Languages

Python

Concepts

Data Science

Author (1):

Stephen Klosterman

Table of Contents (9) Chapters

Preface

1. Data Exploration and Cleaning

2. Introduction to Scikit-Learn and Model Evaluation

3. Details of Logistic Regression and Feature Exploration

4. The Bias-Variance Trade-Off

5. Decision Trees and Random Forests

6. Gradient Boosting, XGBoost, and SHAP Values

7. Test Set Analysis, Financial Insights, and Delivery to the Client

Appendix

5. Decision Trees and Random Forests

Overview

In this chapter, we'll shift our focus to another type of machine learning model that has taken data science by storm in recent years: tree-based models. In this chapter, after learning about trees individually, you'll then learn how models made up of many trees, called random forests, can improve the overfitting associated with individual trees. After reading this chapter, you will be able to train decision trees for machine learning purposes, visualize trained decision trees, and train random forests and visualize the results.

Introduction

In the last two chapters, we have gained a thorough understanding of the workings of logistic regression. We have also gotten a lot of experience with using the scikit-learn package in Python to create logistic regression models.

In this chapter, we will introduce a powerful type of predictive model that takes a completely different approach from the logistic regression model: decision trees. Decision trees and the models based on them are some of the most performant models available today for general machine learning applications. The concept of using a tree process to make decisions is simple, and therefore, decision tree models are easy to interpret. However, a common criticism of decision trees is that they overfit to the training data. In order to remedy this issue, researchers have developed ensemble methods, such as random forests, that combine many decision trees to work together and make better predictions than any individual tree could.

We will see that...

Decision Trees

Decision trees and the machine learning models that are based on them, in particular, random forests and gradient boosted trees, are fundamentally different types of models than Generalized Linear Models (GLMs), such as logistic regression. GLMs are rooted in the theories of classical statistics, which have a long history. The mathematics behind linear regression was originally developed at the beginning of the 19th century, by Legendre and Gauss. Because of this, the normal distribution is also known as the Gaussian distribution.

In contrast, while the idea of using a tree process to make decisions is relatively simple, the popularity of decision trees as mathematical models has come about more recently. The mathematical procedures that we currently use for formulating decision trees in the context of predictive modeling were published in the 1980s. The reason for this more recent development is that the methods used to grow decision trees rely on computational power...

Random Forests: Ensembles of Decision Trees

As we saw in the previous exercise, decision trees are prone to overfitting. This is one of the principal criticisms of their usage, despite the fact that they are highly interpretable. We were able to limit this overfitting, to an extent, however, by limiting the maximum depth to which the tree could be grown.

Building on the concepts of decision trees, machine learning researchers have leveraged multiple trees as the basis for more complex procedures, resulting in some of the most powerful and widely used predictive models. In this chapter, we will focus on random forests of decision trees. Random forests are examples of what are called ensemble models, because they are formed by combining other, simpler models. By combining the predictions of many models, it is possible to improve upon the deficiencies of any given one of them. This is sometimes called combining many weak learners to make a strong learner.

Once you understand...

Summary

In this chapter, we've learned how to use decision trees and the ensemble models called random forests that are made up of many decision trees. Using these simply conceived models, we were able to make better predictions than we could with logistic regression, judging by the cross-validation ROC AUC score. This is often the case for many real-world problems. Decision trees are robust to a lot of the potential issues that can prevent logistic regression models from good performance, such as non-linear relationships between features and the response variable, and the presence of complicated interactions among features.

Although a single decision tree is prone to overfitting, the random forest ensemble method has been shown to reduce this high-variance issue. Random forests are built by training many trees. The decreased variance of the ensemble of trees is achieved by increasing the bias of the individual trees in the forest, by only training them on a portion of the...