You're reading from Spark Cookbook
The following is Wikipedia's definition of supervised learning:
"Supervised learning is the machine learning task of inferring a function from labeled training data."
Supervised learning has two steps:
Train the algorithm with training dataset; it is like giving questions and their answers first
Use test dataset to ask another set of questions to the trained algorithm
There are two types of supervised learning algorithms:
Regression: This predicts continuous value output, such as house price.
Classification: This predicts discreet valued output (0 or 1) called label, such as whether an e-mail is a spam or not. Classification is not limited to two values; it can have multiple values such as marking an e-mail important, not important, urgent, and so on (0, 1, 2…).
As an example dataset for regression, we will use the recently sold house data of the City of Saratoga, CA, as a training set to train the algorithm...
Linear regression is the approach to model the value of a response variable y, based on one or more predictor variables or feature x.
Let's use some housing data to predict the price of a house based on its size. The following are the sizes and prices of houses in the City of Saratoga, CA, in early 2014:
House size (sq ft) |
Price |
---|---|
2100 |
$ 1,620,000 |
2300 |
$ 1,690,000 |
2046 |
$ 1,400,000 |
4314 |
$ 2,000,000 |
1244 |
$ 1,060,000 |
4608 |
$ 3,830,000 |
2173 |
$ 1,230,000 |
2750 |
$ 2,400,000 |
4010 |
$ 3,380,000 |
1959 |
$ 1,480,000 |
$ spark-shell
Import the statistics and related classes:
scala> import org.apache.spark.mllib.linalg.Vectors scala> import org.apache.spark.mllib.regression.LabeledPoint scala> import org.apache.spark.mllib.regression.LinearRegressionWithSGD
Create the
LabeledPoint
array with the house price as the label:scala> val points = Array( LabeledPoint(1620000...
Cost function or loss function is a very important function in machine learning algorithms. Most algorithms have some form of cost function and the goal is to minimize that. Parameters, which affect cost function, such as stepSize
in the last recipe, need to be set by hand. Therefore, understanding the whole concept of cost function is very important.
In this recipe, we are going to analyze cost function for linear regression. Linear regression is a simple algorithm to understand and it will help readers understand the role of cost functions for even complex algorithms.
Let's go back to linear regression. The goal is to find the best-fitting line so that the mean square of error is minimum. Here, we are referring error as the difference between the value as per the best-fitting line and the actual value of the response variable for the training dataset.
For a simple case of single predicate variable, the best-fit line can be written as:
The lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It is based on the original lasso paper found at http://statweb.stanford.edu/~tibs/lasso/lasso.pdf.
The least square method we used in the last recipe is also called ordinary least squares (OLS). OLS has two challenges:
Prediction accuracy: Predictions made using OLS usually have low forecast bias and high variance. Prediction accuracy can be improved by shrinking some coefficients (or even making them zero). There will be some increase in bias, but overall prediction accuracy will improve.
Interpretation: With a large number of predictors, it is desirable to find a subset of them that exhibits the strongest effect (correlation).
An alternate way to lasso to improve prediction quality is ridge regression. While in lasso, a lot of features get their coefficients set to zero and, therefore, eliminated from an equation, in ridge, predictors or features are penalized, but are never set to zero.
Start the Spark shell:
$ spark-shell
Import the statistics and related classes:
scala> import org.apache.spark.mllib.linalg.Vectors scala> import org.apache.spark.mllib.regression.LabeledPoint scala> import org.apache.spark.mllib.regression.RidgeRegressionWithSGD
Create the
LabeledPoint
array with the house price as the label:scala> val points = Array( LabeledPoint(1,Vectors.dense(5,3,1,2,1,3,2,2,1)), LabeledPoint(2,Vectors.dense(9,8,8,9,7,9,8,7,9)) )
Create an RDD of the preceding data:
scala> val rdd = sc.parallelize(points)
Train a model using this data using 100 iterations. Here, the step size and regularization parameter have been set by hand :
scala> val model = RidgeRegressionWithSGD...