Reader small image

You're reading from  The Applied Artificial Intelligence Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800205819
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Anthony So
Anthony So
author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Zsolt Nagy
Zsolt Nagy
author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy

View More author details
Right arrow

4. An Introduction to Decision Trees

Overview

This chapter introduces you to two types of supervised learning algorithms in detail. The first algorithm will help you classify data points using decision trees, while the other algorithm will help you classify data points using random forests. Furthermore, you'll learn how to calculate the precision, recall, and F1 score of models, both manually and automatically. By the end of this chapter, you will be able to analyze the metrics that are used for evaluating the utility of a data model and classify data points based on decision trees and random forest algorithms.

Introduction

In the previous two chapters, we learned the difference between regression and classification problems, and we saw how to train some of the most famous algorithms. In this chapter, we will look at another type of algorithm: tree-based models.

Tree-based models are very popular as they can model complex non-linear patterns and they are relatively easy to interpret. In this chapter, we will introduce you to decision trees and the random forest algorithms, which are some of the most widely used tree-based models in the industry.

Decision Trees

A decision tree has leaves, branches, and nodes. Nodes are where a decision is made. A decision tree consists of rules that we use to formulate a decision (or prediction) on the prediction of a data point.

Every node of the decision tree represents a feature, while every edge coming out of an internal node represents a possible value or a possible interval of values of the tree. Each leaf of the tree represents a label value of the tree.

This may sound complicated, but let's look at an application of this.

Suppose we have a dataset with the following features and the response variable is determining whether a person is creditworthy or not:

Figure 4.1: Sample dataset to formulate the rules

A decision tree, remember, is just a group of rules. Looking at the dataset in Figure 4.1, we can come up with the following rules:

  • All people with house loans are determined as creditworthy.
  • If debtors are employed and studying, then...

The Confusion Matrix

Previously, we learned how to use some calculated metrics to assess the performance of a classifier. There is another very interesting tool that can help you evaluate the performance of a multi-class classification model: the confusion matrix.

A confusion matrix is a square matrix where the number of rows and columns equals the number of distinct label values (or classes). In the columns of the matrix, we place each test label value. In the rows of the matrix, we place each predicted label value.

A confusion matrix looks like this:

Figure 4.10: Sample confusion matrix

In the preceding example, the first row of the confusion matrix is showing us that the model is doing the following:

  • Correctly predicting class A 88 times
  • Predicting class A when the true value is B 3 times
  • Predicting class A when the true value is C 2 times

We can also see the scenario where the model is making a lot of mistakes when it is predicting...

Random Forest Classifier

If you think about the name random forest classifier, it can be explained as follows:

  • A forest consists of multiple trees.
  • These trees can be used for classification.
  • Since the only tree we have used so far for classification is a decision tree, it makes sense that the random forest is a forest of decision trees.
  • The random nature of the trees means that our decision trees are constructed in a randomized manner.

Therefore, we will base our decision tree construction on information gain or Gini Impurity.

Once you understand these basic concepts, you essentially know what a random forest classifier is all about. The more trees you have in the forest, the more accurate prediction is going to be. When performing prediction, each tree performs classification. We collect the results, and the class that gets the most votes wins.

Random forests can be used for regression as well as for classification. When using random forests...

Summary

In this chapter, we learned how to use decision trees for prediction. Using ensemble learning techniques, we created complex reinforcement learning models to predict the class of an arbitrary data point.

Decision trees proved to be very accurate on the surface, but they were prone to overfitting the model. Random forests and extremely randomized trees reduce overfitting by introducing some random elements and a voting algorithm, where the majority wins.

Beyond decision trees, random forests, and extremely randomized trees, we also learned about new methods for evaluating the utility of a model. After using the well-known accuracy score, we started using the precision, recall, and F1 score metrics to evaluate how well our classifier works. All of these values were derived from the confusion matrix.

In the next chapter, we will describe the clustering problem and compare and contrast two clustering algorithms.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Artificial Intelligence Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781800205819
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy