Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark 2.x Cookbook

You're reading from  Apache Spark 2.x Cookbook

Product type Book
Published in May 2017
Publisher
ISBN-13 9781787127265
Pages 294 pages
Edition 1st Edition
Languages
Author (1):
Rishi Yadav Rishi Yadav
Profile icon Rishi Yadav

Table of Contents (19) Chapters

Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Getting Started with Apache Spark Developing Applications with Spark Spark SQL Working with External Data Sources Spark Streaming Getting Started with Machine Learning Supervised Learning with MLlib — Regression Supervised Learning with MLlib — Classification Unsupervised Learning Recommendations Using Collaborative Filtering Graph Processing Using GraphX and GraphFrames Optimizations and Performance Tuning

Chapter 7. Supervised Learning with MLlib — Regression

This chapter is divided into the following recipes:

  • Using linear regression
  • Understanding the cost function
  • Doing linear regression with lasso
  • Doing ridge regression

Introduction


The following is Wikipedia's definition of supervised learning:

"Supervised learning is the machine learning task of inferring a function from labeled training data."

There are two types of supervised learning algorithms:

  • Regression: This predicts a continuous valued output, such as a house price.
  • Classification: This predicts a discreet valued output (0 or 1) called label, such as whether an e-mail is a spam or not. Classification is not limited to two values (binomial); it can have multiple values (multinomial), such as marking an e-mail important, unimportant, urgent, and so on (0, 1, 2...). 

We are going to cover regression in this chapter and classification in the next.

We will use the recently sold house data of the City of Saratoga, CA, as an example to illustrate the steps of supervised learning in the case of regression:

  1. Get the labeled data:
    • How labeled data is gathered differs in every use case. For example, to convert paper documents into a digital format, documents can...

Using linear regression


Linear regression is the approach to model the value of a response or outcome variable y, based on one or more predictor variables or features, represented by x.

Getting ready

Let's use some housing data to predict the price of a house based on its size. The following are the sizes and prices of houses in the City of Saratoga, CA, in early 2014:

House size (sq. ft.)

Price

2100

$ 1,620,000

2300

$ 1,690,000

2046

$ 1,400,000

4314

$ 2,000,000

1244

$ 1,060,000

4608

$ 3,830,000

2173

$ 1,230,000

2750

$ 2,400,000

4010

$ 3,380,000

1959

$ 1,480,000

Here's a graphical representation of the same:

How to do it...

  1. Start the Spark shell:
$ spark-shell
  1. Import the statistics and related classes:
scala> import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.regression.LinearRegression
  1. Create a DataFrame with the house price as the label:
scala>  val points = spark.createDataFrame(Seq(
  (1620000,Vectors.dense(2100)),
  (1690000,Vectors.dense(2300)),
  (1400000,Vectors.dense(2046)),
...

Understanding the cost function


The cost function or loss function is a very important function in machine learning algorithms. Most algorithms have some form of cost function, and the goal is to minimize this. Parameters, which affect cost functions, such as stepSize, are called hyperparameters; they need to be set by hand. Therefore, understanding the whole concept of the cost function is very important.

In this recipe, we are going to analyze the cost function in linear regression. Linear regression is a simple algorithm to understand, and it will help you understand the role of cost functions for even complex algorithms.

Let's go back to linear regression. The goal is to find the best-fitting line so that the mean square of the error would be minimum. Here, we are referring to an error as the difference between the value as per the best-fitting line and the actual value of the response variable of the training dataset.

For a simple case of a single predicate variable, the best-fitting line...

Doing linear regression with lasso


Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors with an upper bound on the sum of the absolute values of the coefficients. It is based on the original lasso paper found at http://statweb.stanford.edu/~tibs/lasso/lasso.pdf.

The least square method we used in the last recipe is also called ordinary least squares (OLS). OLS has two challenges:

  • Prediction accuracy: Predictions made using OLS usually have low forecast bias and high variance. Prediction accuracy can be improved by shrinking some coefficients (or even making them zero). There will be some increase in bias, but the overall prediction accuracy will improve.
  • Interpretation: As a large number of predictors are available, it is desirable that we find a subset of them that exhibits the strongest effect (correlation).

Bias versus variance

There are two primary reasons behind a prediction error: bias and variance. The best way to understand bias...

Doing ridge regression


 

An alternate way to improve prediction quality is to do ridge regression. In lasso, a lot of the features get their coefficients set to zero and, therefore, eliminated from the equation. In ridge, predictors or features are penalized, but never set to zero. How to do it...

  1. Start the Spark shell:
$ spark-shell
  1. Import the statistics and related classes:
scala> import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.regression.LinearRegression
  1. Create the dataset with the value we created earlier:
scala>  val points = spark.createDataFrame(Seq(
    (1d,Vectors.dense(5,3,1,2,1,3,2,2,1)),
    (2d,Vectors.dense(9,8,8,9,7,9,8,7,9))
)).toDF("label","features")
  1. Initialize the linear regression estimator with elastic net param 1 (means ridge or L2 regularization):
scala> val lr = new LinearRegression().setMaxIter(10).setRegParam(.3).setFitIntercept(false).setElasticNetParam(0.0)
  1. Train a model:
scala> val model = lr.fit(points)
  1. Check how many predictors...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Cookbook
Published in: May 2017 Publisher: ISBN-13: 9781787127265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}