Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with R

Product type Book

Published in Jun 2015

Publisher

ISBN-13 9781783982806

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Authors (2):

Rui Miguel Forte

View More author details

Table of Contents (19) Chapters

Mastering Predictive Analytics with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Gearing Up for Predictive Modeling

Linear Regression

Logistic Regression

Neural Networks

Support Vector Machines

Tree-based Methods

Ensemble Methods

Probabilistic Graphical Models

Time Series Analysis

Topic Modeling

Recommendation Systems

Index

Chapter 5. Support Vector Machines

In this chapter, we are going to take a fresh look at nonlinear predictive models by introducing support vector machines. Support vector machines, often abbreviated as SVMs, are very commonly used for classification problems, although there are certainly ways to perform function approximation and regression tasks with them. In this chapter, we will focus on the more typical case of their role in classification. To do this, we'll first present the notion of maximal margin classification, which presents an alternative formulation of how to choose between many possible classification boundaries and differs from approaches such as maximum likelihood, which we have seen thus far. We'll introduce the related idea of support vectors and how, together with maximal margin classification, we can obtain a linear model in the form of a support vector classifier. Finally, we'll present how we can generalize these ideas in order to introduce nonlinearity through the...

Maximal margin classification

We'll begin this chapter by returning to a situation that should be very familiar by now: the binary classification task. Once again, we'll be thinking about the problem of how to design a model that will correctly predict whether an observation belongs to one of two possible classes. We've already seen that this task is simplest when the two classes are linearly separable, that is, when we can find a separating hyperplane (a plane in a multidimensional space) in the space of our features so that all the observations on one side of the hyperplane belong to one class and all the observations that lie on the other side belong to the second class. Depending on the structure, assumptions, and optimizing criterion that our particular model uses, we could end up with one of infinitely many such hyperplanes.

Let's visualize this scenario using some data in a two-dimensional feature space, where the separating hyperplane is just a separating line:

In the preceding diagram...

Support vector classification

We need our data to be linearly separable in order to classify with a maximal margin classifier. When our data is not linearly separable, we can still use the notion of support vectors that define a margin, but this time, we will allow some examples to be misclassified. Thus, we essentially define a soft margin in that some of the observations in our data set can violate the constraint that they need to be at least as far as the margin from the separating hyperplane. It is also important to note that sometimes, we may want to use a soft margin even for linearly separable data. The reason for this is in order to limit the degree of overfitting the data. Note that the larger the margin, the more confident we are about our ability to correctly classify new observations, because the classes are further apart from each other in our training data. If we achieve separation using a very small margin, we are less confident about our ability to correctly classify our...

Kernels and support vector machines

So far, we've introduced the notion of maximum margin classification under linearly separable conditions and its extension to the support vector classifier, which still uses a hyperplane as the separating boundary but handles data sets that are not linearly separable by specifying a budget for tolerating errors. The observations that are on or within the margin, or are misclassified by the support vector classifier are support vectors. The critical role that these play in the positioning of the decision boundary was also seen in an alternative model representation of the support vector classifier that uses inner products.

What is common in the situations that we have seen so far in this chapter is that our model is always linear in terms of the input features. We've seen that the ability to create models that implement nonlinear boundaries between the classes to be separated is far more flexible in terms of the different kinds of underlying target functions...

Predicting chemical biodegration

In this section, we are going to use R's e1071 package to try out the models we've discussed on a real-world data set. As our first example, we have chosen the QSAR biodegration data set, which can be found at https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation#. This is a data set containing 41 numerical variables that describe the molecular composition and properties of 1055 chemicals. The modeling task is to predict whether a particular chemical will be biodegradable based on these properties. Example properties are the percentages of carbon, nitrogen, and oxygen atoms as well as the number of heavy atoms in the molecule. These features are highly specialized and sufficiently numerous, so a full listing won't be given here. The complete list and further details of the quantities involved can be found on the website. For now, we've downloaded the data into a bdf data frame:

> bdf <- read.table("biodeg.csv", sep = ";", quote = "\"")
> head...

Cross-validation

We've seen that many times in the real world, we come across a situation where we don't have an available test data set that we can use in order to measure the performance of our model on unseen data. The most typical reason is that we have very few data overall and want to use all of it to train our model. Another situation is that we want to keep a sample of the data as a validation set to tune some model meta parameters such as cost and gamma for SVMs with radial kernels, and as a result, we've already reduced our starting data and don't want to reduce it further.

Whatever the reason for the lack of a test data set, we already know that we should never use our training data as a measure of model performance and generalization because of the problem of overfitting. This is especially relevant for powerful and expressive models such as the nonlinear models of neural networks and SVMs with radial kernels that are often capable of approximating the training data very closely...

Predicting credit scores

In this section, we will explore another data set; this time, in the field of banking and finance. The particular data set in question is known as the German Credit Dataset and is also hosted by the UCI Machine Learning Repository. The link to the data is https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29.

The observations in the data set are loan applications made by individuals at a bank. The goal of the data is to determine whether an application constitutes a high credit risk.

Multiclass classification with support vector machines

Just like with logistic regression, we've seen that the basic premise behind the support vector machine is that it is designed to handle two classes. Of course, we often have situations where we would like to be able to handle a greater number of classes, such as when classifying different plant species based on a variety of physical characteristics. One way to do this is the one versus all approach. Here, if we have K classes, we create K SVM classifiers, and for each classifier, we are attempting to distinguish one particular class from all the rest. To determine the best class to pick, we assign the class for which the observation produces the highest distance from the separating hyperplane, thus lying farthest away from all other classes. More formally, we pick the class for which our linear feature combination has a maximum value across all the different classifiers.

An alternative approach is known as the (balanced) one versus one...

Summary

In this chapter, we presented the maximal margin hyperplane as a decision boundary that is designed to separate two classes by finding the maximum distance from either of them. When the two classes are linearly separable, this creates a situation where the space between the two classes is evenly split.

We've seen that there are circumstances where this is not always desirable, such as when the classes are close to each other because of a few observations. An improvement to this approach is the support vector classifier that allows us to tolerate a few margin violations, or even misclassifications, in order to obtain a more stable result. This also allows us to handle classes that aren't linearly separable. The form of the support vector classifier can be written in terms of inner products between the observation that is being classified and the support vectors. This transforms our feature space from p features into as many features as we have support vectors. Using kernel functions...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Predictive Analytics with R

Published in: Jun 2015 Publisher: ISBN-13: 9781783982806

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (2)

Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No

See other products by Rui Miguel Forte

Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.

See other products by Rui Miguel Forte

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes

Column name	Type	Definition
`checking`	Categorical	The status of the existing checking account
`duration`	Numerical	The duration in months
`creditHistory`	Categorical	The applicant's credit history
`purpose`	Categorical	The purpose of the loan
`credit`	Numerical	The credit amount
`savings`	Categorical	Savings account/bonds
`employment`	Categorical	Present employment since
`installmentRate`	Numerical	The installment rate (as a percentage of disposable income...