Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with R

Product typeBook

Published inJun 2015

Reading LevelExpert

Publisher

ISBN-139781783982806

Edition1st Edition

Languages

Tools

RStudio

Concepts

Predictive Analytics

Authors (2):

Rui Miguel Forte

View More author details

Chapter 6. Tree-based Methods

In this chapter, we are going to present one of the most intuitive ways to create a predictive model—using the concept of a tree. Tree-based models, often also known as decision tree models, are successfully used to handle both regression and classification type problems. We'll explore both scenarios in this chapter, and we'll be looking at a range of different algorithms that are effective in training these models. We will also learn about a number of useful properties that these models possess, such as their ability to handle missing data and the fact that they are highly interpretable.

The intuition for tree models

A decision tree is a model with a very straightforward structure that allows us to make a prediction on an output variable, based on a series of rules arranged in a tree-like structure. The output variable that we can model can be categorical, allowing us to use a decision tree to handle classification problems. Equally, we can use decision trees to predict a numerical output, and in this way we'll also be able to tackle problems where the predictive task is a regression task.

Decision trees consist of a series of split points, often referred to as nodes. In order to make a prediction using a decision tree, we start at the top of the tree at a single node known as the root node. The root node is a decision or split point, because it places a condition in terms of the value of one of the input features, and based on this decision we know whether to continue on with the left part of the tree or with the right part of the tree. We repeat this process of choosing...

Algorithms for training decision trees

Now that we have understood how a decision tree works, we'll want to address the issue of how we can train one using some data. There are several algorithms that have been proposed to build decision trees, and in this section we will present a few of the most well-known. One thing we should bear in mind is that whatever tree-building algorithm we choose, we will have to answer four fundamental questions:

For every node (including the root node), how should we choose the input feature to split on and, given this feature, what is the value of the split point?
How do we decide whether a node should become a leaf node or if we should make another split point?
How deep should our tree be allowed to become?
Once we arrive at a leaf node, what value should we predict?

Note

A great introduction to decision trees is Chapter 3 of Machine Learning, Tom Mitchell. This book was probably the first comprehensive introduction to machine learning and is well worth reading...

Predicting class membership on synthetic 2D data

Our first example showcasing tree-based methods in R will operate on a synthetic data set that we have created. The data set can be generated using commands in the companion R file for this chapter, available from the publisher. The data consists of 287 observations of two input features, x1 and x2.

The output variable is a categorical variable with three possible classes: a, b, and c. If we follow the commands in the code file, we will end up with a data frame in R, mcdf:

> head(mcdf, n = 5)
          x1       x2 class
1 18.58213 12.03106     a
2 22.09922 12.36358     a
3 11.78412 12.75122     a
4 23.41888 13.89088     a
5 16.37667 10.32308     a

This problem is actually very simple because on the one hand, we have a very small data set with only two features, and on the other because the classes happen to be quite well separated in the feature space, something that is very rare. Nonetheless, our objective in this section is to demonstrate...

Predicting the authenticity of banknotes

In this section, we will study the problem of predicting whether a particular banknote is genuine or whether it has been forged. The banknote authentication data set is hosted at https://archive.ics.uci.edu/ml/datasets/banknote+authentication. The creators of the data set have taken specimens of both genuine and forged banknotes and photographed them with an industrial camera. The resulting grayscale image was processed using a type of time-frequency transformation known as a wavelet transform. Three features of this transform are constructed, and along with the image entropy, they make up the four features in total for this binary classification task.

Predicting complex skill learning

In this section, we'll have a chance to explore data from an innovative and recent project known as SkillCraft. The interested reader can find out more about this project on the Web by going to http://skillcraft.ca/. The key premise behind the project is that by studying the performance of players in a real-time strategy (RTS) game that involves complex resource management and strategic decisions, we can study how humans learn complex skills and develop speed and competence in dynamic resource allocation scenarios. To achieve this, data has been collected from players playing the popular real-time strategy game, Starcraft 2, developed by Blizzard.

In this game, players compete against each other on one of many fixed maps and starting locations. Each player must choose a fictional race from three available choices and start with six worker units, which are used to collect one of two game resources. These resources are needed in order to build military and...

Summary

In this chapter, we learned how to build decision trees for regression and classification tasks. We saw that although the idea is simple, there are several decisions that we have to make in order to construct our tree model, such as what splitting criterion to use, as well as when and how to prune our final tree.

In each case, we considered a number of viable options and it turns out that there are several algorithms that are used to build decision tree models. Some of the best qualities of decision trees are the fact that they are typically easy to implement and very easy to interpret, while making no assumptions about the underlying model of the data. Decision trees have native options for performing feature selection and handling missing data, and are very capable of handling a wide range of feature types.

Having said that, we saw that from a computational perspective, finding a split for categorical variables is quite expensive due to the exponential growth of the number of possible...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Predictive Analytics with R

Published in: Jun 2015Publisher: ISBN-13: 9781783982806

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No
Read more about Rui Miguel Forte

Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Read more about Rui Miguel Forte

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Column name	Type	Definition
`waveletVar`	Numerical	Variance of the wavelet-transformed image
`waveletSkew`	Numerical	Skewness of the wavelet-transformed image
`waveletCurt`	Numerical	Curtosis of the wavelet-transformed image
`entropy`	Numerical	Entropy of the image
`class`	Binary	Authenticity...