Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with R

Product type Book

Published in Jun 2015

Publisher

ISBN-13 9781783982806

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Authors (2):

Rui Miguel Forte

View More author details

Table of Contents (19) Chapters

Mastering Predictive Analytics with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Gearing Up for Predictive Modeling

Linear Regression

Logistic Regression

Neural Networks

Support Vector Machines

Tree-based Methods

Ensemble Methods

Probabilistic Graphical Models

Time Series Analysis

Topic Modeling

Recommendation Systems

Index

Chapter 7. Ensemble Methods

In this chapter, we take a step back from learning new models and instead think about how several trained models can work together as an ensemble, in order to produce a single model that is more powerful than the individual models involved.

The first type of ensemble that we will study uses different samples of the same data set in order to train multiple versions of the same model. These models then vote on the correct answer for a new observation and an average or majority decision is made, depending on the type of problem. This process is known as bagging, which is short for bootstrap aggregation. Another approach to combine models is boosting. This essentially involves training a chain of models and assigning weights to observations that were incorrectly classified or fell far from their predicted value so that successive models are forced to prioritize them.

As methods, bagging and boosting are fairly general and have been applied with a number of different...

Bagging

The focus of this chapter is on combining the results from different models in order to produce a single model that will outperform individual models on their own. Bagging is essentially an intuitive procedure for combining multiple models trained on the same data set, by using majority voting for classification models and average value for regression models. We'll present this procedure for the classification case, and later show how this is easily extended to handle regression models.

Boosting

Boosting offers an alternative take on the problem of how to combine models together to achieve greater performance. In particular, it is especially suited to weak learners. Weak learners are models that produce an accuracy that is better than a model that randomly guesses, but not by much. One way to create a weak learner is to use a model whose complexity is configurable.

For example, we can train a multilayer perceptron network with a very small number of hidden layer neurons. Similarly, we can train a decision tree but only allow the tree to comprise a single node, resulting in a single split in the input data. This special type of decision tree is known as a stump.

When we looked at bagging, the key idea was to take a set of random bootstrapped samples of the training data and then train multiple versions of the same model using these different samples. In the classical boosting scenario, there is no random component, as all the models use all of the training data.

For classification...

Predicting atmospheric gamma ray radiation

In order to study boosting in action, in this section we'll introduce a new prediction problem from the field of atmospheric physics. More specifically, we will analyze the patterns made by radiation on a telescope camera in order to predict whether a particular pattern came from gamma rays leaking into the atmosphere, or from regular background radiation.

Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The data set we will use is the MAGIC Gamma Telescope data set, hosted by the UCI Machine Learning repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. Our data consists of 19,020 observations of the following attributes:

Bagging procedure for binary classification

Inputs:

data: The input data frame containing the input features and a column with the binary output label
M: An integer, representing the number of models that we want to train

Output:

models: A set of Μ trained binary classifier models

Method:

1. Create a random sample of size n, where n is the number of observations in the original data set, with replacement. This means that some of the observations from the original training set will be repeated and some...

Predicting complex skill learning with boosting

We will revisit our Skillcraft data set in this section—this time in the context of another boosting technique known as stochastic gradient boosting. The main characteristic of this method is that in every iteration of boosting, we compute a gradient in the direction of the errors that are made by the model trained in the current iteration.

This gradient is then used in order to guide the construction of the model that will be added in the next iteration. Stochastic gradient boosting is commonly used with decision trees, and a good implementation in R can be found in the gbm package, which provides us with the gbm() function. For regression problems, we need to specify the distribution parameter to be gaussian. In addition, we can specify the number of trees we want to build (which is equivalent to the number of iterations of boosting) via the n.trees parameter, as well as a shrinkage parameter that is used to control the algorithm's learning...

Random forests

The final ensemble model that we will discuss in this chapter is unique to tree-based models and is known as the random forest. In a nutshell, the idea behind random forests stems from an observation on bagging trees. Let's suppose that the actual relationship between the features and the target variable can be adequately described with a tree structure. It is quite likely that during bagging with moderately sized bootstrapped samples, we will keep picking the same features to split on high up in the tree.

For example, in our Skillcraft data set, we expect to see APM as the feature that will be chosen at the top of most of the bagged trees. This is a form of tree correlation that essentially impedes our ability to derive the variance reduction benefits from bagging. Put differently, the different tree models that we build are not truly independent of each other because they will have many features and split points in common. Consequently, the averaging process at the end will...

Summary

In this chapter, we deviated from our usual pattern of learning a new type of model and instead focused on techniques to build ensembles of models that we have seen before. We discovered that there are numerous ways to combine models in a meaningful way, each with its own advantages and limitations. Our first technique for building ensemble models was bagging. The central idea behind bagging is that we build multiple versions of the same model using bootstrap samples of the training data. We then average the predictions made by these models in order to construct our overall prediction. By building many different versions of the model we can smooth out errors made due to overfitting and end up with a model that has reduced variance.

A different approach to building model ensembles uses all of the training data and is known as boosting. Here, the defining characteristic is to train a sequence of models but each time we weigh each observation with a different weight depending on whether...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Predictive Analytics with R

Published in: Jun 2015 Publisher: ISBN-13: 9781783982806

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (2)

Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No

See other products by Rui Miguel Forte

Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.

See other products by Rui Miguel Forte

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes

Column name	Type	Definition
`FLENGTH`	Numerical	The major axis of the ellipse (mm)
`FWIDTH`	Numerical	The minor axis of the ellipse (mm)
`FSIZE`	Numerical	Logarithm to the base ten of the sum of the content of all pixels in the camera photo
`FCONC...`