Packt+ | Advance your knowledge in tech

You're reading from Learning Predictive Analytics with Python

Product typeBook

Published inFeb 2016

Reading LevelIntermediate

Publisher

ISBN-139781783983261

Edition1st Edition

Languages

Python

Concepts

Predictive Analytics

Authors (2):

Ashish Kumar

Gary Dougan

View More author details

Chapter 8. Trees and Random Forests with Python

Clustering, discussed in the last chapter, is an unsupervised algorithm. It is now time to switch back to a supervised algorithm. Classification is a class of problems that surfaces quite frequently in predictive modelling and in various forms. Accordingly, to deal with all of them, a family of classification algorithms is used.

A decision tree is a supervised classification algorithm that is used when the target variable is a discrete or categorical variable (having two or more than two classes) and the predictor variables are either categorical or numerical variables. A decision tree can be thought of as a set of if-then rules for a classification problem where the target variables are discrete or categorical variables. The if-then rules are represented as a tree.

A decision tree is used when the decision is based on multiple-staged criteria and variables. A decision tree is very effective as a decision making tool as it has a pictorial output...

Introducing decision trees

A tree is a data structure that might be used to state certain decision rules because it can be represented in such a way as to pictorially illustrate these rules. A tree has three basic elements: nodes, branches, and leaves. Nodes are the points from where one or more branches come out. A node from where no branch originates is a leaf. A typical tree looks as follows:

Fig. 8.1: A representation of a decision tree with its basic elements—node, branches, and leaves

A tree, specifically a decision tree, starts with a root node, proceeds to the decision nodes, and ultimately to the terminal nodes where the decision rules are made. All nodes, except the terminal node, represent one variable and the branches represent the different categories (values) of that variable. The terminal node represents the final decision or value for that route.

A decision tree

To understand what decision trees look like and how to make sense of them, let us consider an example. Consider a situation...

Understanding the mathematics behind decision trees

The main goal in a decision tree algorithm is to identify a variable and classification on which one can give a more homogeneous distribution with reference to the target variable. The homogeneous distribution means that similar values of the target variable are grouped together so that a concrete decision can be made.

Homogeneity

In the preceding example, the first goal would be to find a parameter (out of four: Terrain, Rainfall, Groundwater, and Fertilizers) that results in a better homogeneous distribution of the target variable within those categories.

Without any parameter, the count of harvest type looks as follows:

Bumper	Moderate	Meagre
4	9	7

Let us calculate, for each parameter, how the split on that parameter affects the homogeneity of the target variable split:

Fig. 8.4: Splitting the predictor and the target variables into categories to see their effect on the homogeneity of the dataset

If one...

Implementing a decision tree with scikit-learn

Now, when we are sufficiently aware of the mathematics behind decision trees, let us implement a simple decision tree using the methods in scikit-learn. The dataset we will be using for this is a commonly available dataset called the iris dataset that has information about flower species and their petal and sepal dimensions. The purpose of this exercise will be to create a classifier that can classify a flower as belonging to a certain species based on the flower petal and sepal dimensions.

To do this, let's first import the dataset and have a look at it:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/iris.csv')
data.head()

The datasheet looks as follows:

Fig. 8.7: The first few observations of the iris dataset

Sepal-length, Sepal-width, Petal-length, and Petal-width are the dimensions of the flower while the Species denotes the class the flower belongs to. There are actually three classes of...

Understanding and implementing regression trees

An algorithm very similar to decision trees is regression tree. The difference between the two is that the target variable in the case of a regression tree is a continuous numerical variable, unlike decision trees where the target variable is a categorical variable.

Regression tree algorithm

Regression trees are particularly useful when there are multiple features in the training dataset that interact in complicated and non-linear ways. In such cases, a simple linear regression or even the linear regression with some tweaks will not be feasible or produces a very complex model that will be of little use. An alternative to non-linear regression is to partition the dataset into smaller nodes/local partitions where the interactions are more manageable. We keep partitioning until the point where the non-linear interactions are non-existent or the observations in that partition/node are very similar to each other. This is called recursive partition...

Understanding and implementing random forests

Random forests is a predictive algorithm falling under the ambit of ensemble learning algorithms. Ensemble learning algorithms consist of a combination of various independent models (similar or different) to solve a particular prediction problem. The final result is calculated based on the results from all these independent models, which is better than the results of any of the independent models.

There are two kinds of ensemble algorithm, as follows:

Averaging methods: Several similar independent models are created (in the case of decision trees, it can mean trees with different depths or trees involving a certain variable and not involving the others, and so on.) and the final prediction is given by the average of the predictions of all the models.
Boosting methods: The goal here is to reduce the bias of the combined estimator by sequentially building it from the base estimators. A powerful model is created using several weak models.

Random forest...

Summary

In this chapter on the decision trees, we first tried to understand the structure and the meaning of a decision tree. This was followed by a discussion on the mathematics behind creating a decision tree. Apart from implementing a decision tree in Python, the chapter also discussed the mathematics of related algorithms such as regression trees and random forests. Here is a brief summary of the chapter:

A decision tree is a classification algorithm used when the predictor variables are either categorical or continuous numerical variables.
Splitting a node into subnodes so that one gets a more homogeneous distribution (similar observations together), is the primary goal while making a tree.
There are various methods to decide which variable should be used to split the node. These methods include information gain, Gini, and maximum reduction in variance methods.
The method of building a regression tree is very similar to a decision tree. However, the target variable in the case of a regression...

The rest of the chapter is locked

You have been reading a chapter from

Learning Predictive Analytics with Python

Published in: Feb 2016Publisher: ISBN-13: 9781783983261

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

Gary Dougan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages