Reader small image

You're reading from  Spark Cookbook

Product typeBook
Published inJul 2015
Publisher
ISBN-139781783987061
Edition1st Edition
Right arrow
Author (1)
Rishi Yadav
Rishi Yadav
author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Right arrow

Chapter 8. Supervised Learning with MLlib – Classification

This chapter is divided into the following recipes:

  • Doing classification using logistic regression

  • Doing binary classification using SVM

  • Doing classification using decision trees

  • Doing classification using Random Forests

  • Doing classification using Gradient Boosted Trees

  • Doing classification with Naïve Bayes

Introduction


The classification problem is like the regression problem discussed in the previous chapter except that the outcome variable y takes only a few discrete values. In binary classification, y takes only two values: 0 or 1. You can also think of values that the response variable can take in classification as representing categories.

Doing classification using logistic regression


In classification, the response variable y has discreet values as opposed to continuous values. Some examples are e-mail (spam/non-spam), transactions (safe/fraudulent), and so on.

The y variable in the following equation can take on two values, 0 or 1:

Here, 0 is referred to as a negative class and 1 means a positive class. Though we are calling them a positive or negative class, it is only for convenience's sake. Algorithms are neutral about this assignment.

Linear regression, though it works well for regression tasks, hits a few limitations for classification tasks. These include:

  • The fitting process is very susceptible to outliers

  • There is no guarantee that the hypothesis function h(x) will fit in the range 0 (negative class) to 1 (positive class)

Logistic regression guarantees that h(x) will fit between 0 and 1. Though logistic regression has the word regression in it, it is more of a misnomer and it is very much a classification algorithm:

In...

Doing binary classification using SVM


Classification is a technique to put data into different classes based on its utility. For example, an e-commerce company can apply two labels "will buy" or "will not buy" to potential visitors.

This classification is done by providing some already labeled data to machine learning algorithms called training data. The challenge is how to mark the boundary between two classes. Let's take a simple example as shown in the following figure:

In the preceding case, we designated gray and black to the "will not buy" and "will buy" labels. Here, drawing a line between the two classes is as easy as follows:

Is this the best we can do? Not really, let's try to do a better job. The black classifier is not really equidistant from the "will buy" and "will not buy" carts. Let's make a better attempt like the following:

Now this is looking good. This in fact is what the SVM algorithm does. You can see in the preceding diagram that in fact there are only three carts that...

Doing classification using decision trees


Decision trees are the most intuitive among machine learning algorithms. We use decision trees in daily life all the time.

Decision tree algorithms have a lot of useful features:

  • Easy to understand and interpret

  • Work with both categorical and continuous features

  • Work with missing features

  • Do not require feature scaling

Decision tree algorithms work in an upside-down order in which an expression containing a feature is evaluated at every level and that splits the dataset into two categories. We'll help you understand this with the simple example of a dumb charade, which most of us played in college. I guessed an animal and asked my coworker ask me questions to work out my choice. Here's how her questioning went:

Q1: Is it a big animal?

A: Yes

Q2: Does this animal live more than 40 years?

A: Yes

Q3: Is this animal an elephant?

A: Yes

This is an obviously oversimplified case in which she knew I had postulated an elephant (what else would you guess in a Big Data...

Doing classification using Random Forests


Sometimes one decision tree is not enough, so a set of decision trees is used to produce more powerful models. These are called ensemble learning algorithms. Ensemble learning algorithms are not limited to using decision trees as base models.

The most popular among the ensemble learning algorithms is Random Forest. In Random Forest, rather than growing one single tree, K trees are grown. Every tree is given a random subset S of training data. To add a twist to it, every tree only uses a subset of features. When it comes to making predictions, a majority vote is done on the trees and that becomes the prediction.

Let's explain this with an example. The goal is to make a prediction for a given person about whether he/she has good credit or bad credit.

To do this, we will provide labeled training data—that is, in this case, a person with features and labels whether he/she has good credit or bad credit. Now we do not want to create feature bias so we will...

Doing classification using Gradient Boosted Trees


Another ensemble learning algorithm is Gradient Boosted Trees (GBTs). GBTs train one tree at a time, where each new tree improves upon the shortcomings of previously trained trees.

As GBTs train one tree at a time, they can take longer than Random Forest.

Getting ready

We are going to use the same data we used in the previous recipe.

How to do it…

  1. Start the Spark shell:

    $ spark-shell
    
  2. Perform the required imports:

    scala> import org.apache.spark.mllib.tree.GradientBoostedTrees
    scala> import org.apache.spark.mllib.tree.configuration.BoostingStrategy
    scala> import org.apache.spark.mllib.util.MLUtils
    
  3. Load and parse the data:

    scala> val data =
      MLUtils.loadLibSVMFile(sc, "rf_libsvm_data.txt")
    
  4. Split the data into training and test datasets:

    scala> val splits = data.randomSplit(Array(0.7, 0.3))
    scala> val (trainingData, testData) = (splits(0), splits(1))
    
  5. Create a classification as a boosting strategy and set the number of iterations...

Doing classification with Naïve Bayes


Let's consider building an e-mail spam filter using machine learning. Here we are interested in two classes: spam for unsolicited messages and non-spam for regular emails:

The first challenge is that, when given an e-mail, how do we represent it as feature vector x. An e-mail is just bunch of text or a collection of words (therefore, this problem domain falls into a broader category called text classification). Let's represent an e-mail with a feature vector with the length equal to the size of the dictionary. If a given word in a dictionary appears in an e-mail, the value will be 1; otherwise 0. Let's build a vector representing e-mail with the content online pharmacy sale:

The dictionary of words in this feature vector is called vocabulary and the dimensions of the vector are the same as the size of vocabulary. If the vocabulary size is 10,000, the possible values in this feature vector will be 210,000.

Our goal is to model the probability of x given...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Spark Cookbook
Published in: Jul 2015Publisher: ISBN-13: 9781783987061
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav