Reader small image

You're reading from  Mastering Predictive Analytics with R

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781783982806
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Rui Miguel Forte
Rui Miguel Forte
author image
Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No
Read more about Rui Miguel Forte

Rui Miguel Forte
Rui Miguel Forte
author image
Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Read more about Rui Miguel Forte

View More author details
Right arrow

Chapter 6. Tree-based Methods

In this chapter, we are going to present one of the most intuitive ways to create a predictive model—using the concept of a tree. Tree-based models, often also known as decision tree models, are successfully used to handle both regression and classification type problems. We'll explore both scenarios in this chapter, and we'll be looking at a range of different algorithms that are effective in training these models. We will also learn about a number of useful properties that these models possess, such as their ability to handle missing data and the fact that they are highly interpretable.

The intuition for tree models


A decision tree is a model with a very straightforward structure that allows us to make a prediction on an output variable, based on a series of rules arranged in a tree-like structure. The output variable that we can model can be categorical, allowing us to use a decision tree to handle classification problems. Equally, we can use decision trees to predict a numerical output, and in this way we'll also be able to tackle problems where the predictive task is a regression task.

Decision trees consist of a series of split points, often referred to as nodes. In order to make a prediction using a decision tree, we start at the top of the tree at a single node known as the root node. The root node is a decision or split point, because it places a condition in terms of the value of one of the input features, and based on this decision we know whether to continue on with the left part of the tree or with the right part of the tree. We repeat this process of choosing...

Algorithms for training decision trees


Now that we have understood how a decision tree works, we'll want to address the issue of how we can train one using some data. There are several algorithms that have been proposed to build decision trees, and in this section we will present a few of the most well-known. One thing we should bear in mind is that whatever tree-building algorithm we choose, we will have to answer four fundamental questions:

  • For every node (including the root node), how should we choose the input feature to split on and, given this feature, what is the value of the split point?

  • How do we decide whether a node should become a leaf node or if we should make another split point?

  • How deep should our tree be allowed to become?

  • Once we arrive at a leaf node, what value should we predict?

Note

A great introduction to decision trees is Chapter 3 of Machine Learning, Tom Mitchell. This book was probably the first comprehensive introduction to machine learning and is well worth reading...

Predicting class membership on synthetic 2D data


Our first example showcasing tree-based methods in R will operate on a synthetic data set that we have created. The data set can be generated using commands in the companion R file for this chapter, available from the publisher. The data consists of 287 observations of two input features, x1 and x2.

The output variable is a categorical variable with three possible classes: a, b, and c. If we follow the commands in the code file, we will end up with a data frame in R, mcdf:

> head(mcdf, n = 5)
          x1       x2 class
1 18.58213 12.03106     a
2 22.09922 12.36358     a
3 11.78412 12.75122     a
4 23.41888 13.89088     a
5 16.37667 10.32308     a

This problem is actually very simple because on the one hand, we have a very small data set with only two features, and on the other because the classes happen to be quite well separated in the feature space, something that is very rare. Nonetheless, our objective in this section is to demonstrate...

Predicting the authenticity of banknotes


In this section, we will study the problem of predicting whether a particular banknote is genuine or whether it has been forged. The banknote authentication data set is hosted at https://archive.ics.uci.edu/ml/datasets/banknote+authentication. The creators of the data set have taken specimens of both genuine and forged banknotes and photographed them with an industrial camera. The resulting grayscale image was processed using a type of time-frequency transformation known as a wavelet transform. Three features of this transform are constructed, and along with the image entropy, they make up the four features in total for this binary classification task.

Predicting complex skill learning


In this section, we'll have a chance to explore data from an innovative and recent project known as SkillCraft. The interested reader can find out more about this project on the Web by going to http://skillcraft.ca/. The key premise behind the project is that by studying the performance of players in a real-time strategy (RTS) game that involves complex resource management and strategic decisions, we can study how humans learn complex skills and develop speed and competence in dynamic resource allocation scenarios. To achieve this, data has been collected from players playing the popular real-time strategy game, Starcraft 2, developed by Blizzard.

In this game, players compete against each other on one of many fixed maps and starting locations. Each player must choose a fictional race from three available choices and start with six worker units, which are used to collect one of two game resources. These resources are needed in order to build military and...

Summary


In this chapter, we learned how to build decision trees for regression and classification tasks. We saw that although the idea is simple, there are several decisions that we have to make in order to construct our tree model, such as what splitting criterion to use, as well as when and how to prune our final tree.

In each case, we considered a number of viable options and it turns out that there are several algorithms that are used to build decision tree models. Some of the best qualities of decision trees are the fact that they are typically easy to implement and very easy to interpret, while making no assumptions about the underlying model of the data. Decision trees have native options for performing feature selection and handling missing data, and are very capable of handling a wide range of feature types.

Having said that, we saw that from a computational perspective, finding a split for categorical variables is quite expensive due to the exponential growth of the number of possible...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Predictive Analytics with R
Published in: Jun 2015Publisher: ISBN-13: 9781783982806
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No
Read more about Rui Miguel Forte

author image
Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Read more about Rui Miguel Forte

Column name

Type

Definition

waveletVar

Numerical

Variance of the wavelet-transformed image

waveletSkew

Numerical

Skewness of the wavelet-transformed image

waveletCurt

Numerical

Curtosis of the wavelet-transformed image

entropy

Numerical

Entropy of the image

class

Binary

Authenticity...