Packt+ | Advance your knowledge in tech

You're reading from Spark Cookbook

Product typeBook

Published inJul 2015

Publisher

ISBN-139781783987061

Edition1st Edition

Tools

Apache Spark

Concepts

Data Analysis

Author (1)

Rishi Yadav

Chapter 8. Supervised Learning with MLlib – Classification

This chapter is divided into the following recipes:

Doing classification using logistic regression
Doing binary classification using SVM
Doing classification using decision trees
Doing classification using Random Forests
Doing classification using Gradient Boosted Trees
Doing classification with Naïve Bayes

Introduction

The classification problem is like the regression problem discussed in the previous chapter except that the outcome variable y takes only a few discrete values. In binary classification, y takes only two values: 0 or 1. You can also think of values that the response variable can take in classification as representing categories.

Doing classification using logistic regression

In classification, the response variable y has discreet values as opposed to continuous values. Some examples are e-mail (spam/non-spam), transactions (safe/fraudulent), and so on.

The y variable in the following equation can take on two values, 0 or 1:

Here, 0 is referred to as a negative class and 1 means a positive class. Though we are calling them a positive or negative class, it is only for convenience's sake. Algorithms are neutral about this assignment.

Linear regression, though it works well for regression tasks, hits a few limitations for classification tasks. These include:

The fitting process is very susceptible to outliers
There is no guarantee that the hypothesis function h(x) will fit in the range 0 (negative class) to 1 (positive class)

Logistic regression guarantees that h(x) will fit between 0 and 1. Though logistic regression has the word regression in it, it is more of a misnomer and it is very much a classification algorithm:

In...

Doing binary classification using SVM

Classification is a technique to put data into different classes based on its utility. For example, an e-commerce company can apply two labels "will buy" or "will not buy" to potential visitors.

This classification is done by providing some already labeled data to machine learning algorithms called training data. The challenge is how to mark the boundary between two classes. Let's take a simple example as shown in the following figure:

In the preceding case, we designated gray and black to the "will not buy" and "will buy" labels. Here, drawing a line between the two classes is as easy as follows:

Is this the best we can do? Not really, let's try to do a better job. The black classifier is not really equidistant from the "will buy" and "will not buy" carts. Let's make a better attempt like the following:

Now this is looking good. This in fact is what the SVM algorithm does. You can see in the preceding diagram that in fact there are only three carts that...

Doing classification using decision trees

Decision trees are the most intuitive among machine learning algorithms. We use decision trees in daily life all the time.

Decision tree algorithms have a lot of useful features:

Easy to understand and interpret
Work with both categorical and continuous features
Work with missing features
Do not require feature scaling

Decision tree algorithms work in an upside-down order in which an expression containing a feature is evaluated at every level and that splits the dataset into two categories. We'll help you understand this with the simple example of a dumb charade, which most of us played in college. I guessed an animal and asked my coworker ask me questions to work out my choice. Here's how her questioning went:

Q1: Is it a big animal?

A: Yes

Q2: Does this animal live more than 40 years?

A: Yes

Q3: Is this animal an elephant?

A: Yes

This is an obviously oversimplified case in which she knew I had postulated an elephant (what else would you guess in a Big Data...

Doing classification using Random Forests

Sometimes one decision tree is not enough, so a set of decision trees is used to produce more powerful models. These are called ensemble learning algorithms. Ensemble learning algorithms are not limited to using decision trees as base models.

The most popular among the ensemble learning algorithms is Random Forest. In Random Forest, rather than growing one single tree, K trees are grown. Every tree is given a random subset S of training data. To add a twist to it, every tree only uses a subset of features. When it comes to making predictions, a majority vote is done on the trees and that becomes the prediction.

Let's explain this with an example. The goal is to make a prediction for a given person about whether he/she has good credit or bad credit.

To do this, we will provide labeled training data—that is, in this case, a person with features and labels whether he/she has good credit or bad credit. Now we do not want to create feature bias so we will...

Doing classification using Gradient Boosted Trees

Another ensemble learning algorithm is Gradient Boosted Trees (GBTs). GBTs train one tree at a time, where each new tree improves upon the shortcomings of previously trained trees.

As GBTs train one tree at a time, they can take longer than Random Forest.

Getting ready

We are going to use the same data we used in the previous recipe.

How to do it…

Start the Spark shell:
```
$ spark-shell
```

Perform the required imports:

scala> import org.apache.spark.mllib.tree.GradientBoostedTrees
scala> import org.apache.spark.mllib.tree.configuration.BoostingStrategy
scala> import org.apache.spark.mllib.util.MLUtils

Load and parse the data:

scala> val data =
  MLUtils.loadLibSVMFile(sc, "rf_libsvm_data.txt")

Split the data into training and test datasets:

scala> val splits = data.randomSplit(Array(0.7, 0.3))
scala> val (trainingData, testData) = (splits(0), splits(1))

Create a classification as a boosting strategy and set the number of iterations...

Doing classification with Naïve Bayes

Let's consider building an e-mail spam filter using machine learning. Here we are interested in two classes: spam for unsolicited messages and non-spam for regular emails:

The first challenge is that, when given an e-mail, how do we represent it as feature vector x. An e-mail is just bunch of text or a collection of words (therefore, this problem domain falls into a broader category called text classification). Let's represent an e-mail with a feature vector with the length equal to the size of the dictionary. If a given word in a dictionary appears in an e-mail, the value will be 1; otherwise 0. Let's build a vector representing e-mail with the content online pharmacy sale:

The dictionary of words in this feature vector is called vocabulary and the dimensions of the vector are the same as the size of vocabulary. If the vocabulary size is 10,000, the possible values in this feature vector will be 210,000.

Our goal is to model the probability of x given...

The rest of the chapter is locked

You have been reading a chapter from

Spark Cookbook

Published in: Jul 2015Publisher: ISBN-13: 9781783987061

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages