You're reading from Spark Cookbook
The following is Wikipedia's definition of machine learning:
"Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data."
Essentially, machine learning is making use of past data to make predictions about the future. Machine learning heavily depends upon statistical analysis and methodology.
In statistics, there are four types of measurement scales:
Another...
Before understanding Vectors, let's focus on what is a point. A point is just a set of numbers. This set of numbers or coordinates defines the point's position in space. The numbers of coordinates determine dimensions of the space.
We can visualize space with up to three dimensions. Space with more than three dimensions is called hyperspace. Let's put this spatial metaphor to use.
Let's start with a person. A person has the following dimensions:
Weight
Height
Age
We are working in three-dimensional space here. Thus, the interpretation of point (160,69,24) would be 160 lb weight, 69 inches height, and 24 years age.
Note
Points and vectors are same thing. Dimensions in vectors are called features. In another way, we can define a feature as an individual measurable property of a phenomenon being observed.
Spark has local vectors and matrices and also distributed matrices. Distributed matrix is backed by one or more RDDs. A local vector has numeric indices and double values, and is stored...
Labeled point is a local vector (sparse/dense), which has an associated label with it. Labeled data is used in supervised learning to help train algorithms. You will get to know more about it in the next chapter.
Label is stored as a double value in LabeledPoint
. It means that when you have categorical labels, they need to be mapped to double values. What value you assign to a category is immaterial and is only a matter of convenience.
Type |
Label values |
---|---|
Binary classification |
0 or 1 |
Multiclass classification |
0, 1, 2… |
Regression |
Decimal values |
Start the Spark shell:
$spark-shell
Import the MLlib vector explicitly:
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
Import the
LabeledPoint
:scala> import org.apache.spark.mllib.regression.LabeledPoint
Create a labeled point with a positive label and dense vector:
scala> val willBuySUV = LabeledPoint(1.0,Vectors.dense(300.0,80,40))
Matrix is simply a table to represent multiple feature vectors. A matrix that can be stored on one machine is called local matrix and the one that can be distributed across the cluster is called distributed matrix.
Local matrices have integer-based indices, while distributed matrices have long-based indices. Both have values as doubles.
There are three types of distributed matrices:
$spark-shell
Import the matrix-related classes:
scala> import org.apache.spark.mllib.linalg.{Vectors,Matrix, Matrices}
Create a dense local matrix:
scala> val people = Matrices.dense(3,2,Array(150d,60d,25d, 300d,80d,40d))
Create a
personRDD
as RDD of vectors:scala> val personRDD = sc.parallelize(List(Vectors.dense...
Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:
Central tendency of data—mean, mode, median
Spread of data—variance, standard deviation
Boundary conditions—min, max
This recipe covers how to produce summary statistics.
Start the Spark shell:
$ spark-shell
Import the matrix-related classes:
scala> import org.apache.spark.mllib.linalg.{Vectors,Vector} scala> import org.apache.spark.mllib.stat.Statistics
Create a
personRDD
as RDD of vectors:scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))
Compute the column summary statistics:
scala> val summary = Statistics.colStats(personRDD)
Print the mean of this summary:
scala> print(summary.mean)
Print the variance:
scala> print(summary.variance)
Print the non-zero values in each column:
scala> print(summary.numNonzeros)
Print the sample size:
scala> print(summary...
Correlation is a statistical relationship between two variables such that when one variable changes, it leads to a change in the other variable. Correlation analysis measures the extent to which the two variables are correlated.
If an increase in one variable leads to an increase in another, it is called a positive correlation. If an increase in one variable leads to a decrease in the other, it is a negative correlation.
Spark supports two correlation algorithms: Pearson and Spearman. Pearson algorithm works with two continuous variables, such as a person's height and weight or house size and house price. Spearman deals with one continuous and one categorical variable, for example, zip code and house price.
Let's use some real data so that we can calculate correlation more meaningfully. The following are the size and price of houses in the City of Saratoga, California, in early 2014:
House size (sq ft) |
Price | |
---|---|---|
2100 |
$1,620,000 | |
2300 |
$1,690,000 | |
2046 |
$1... |
Party |
Male |
Female |
---|---|---|
Democratic Party |
32 |
41 |
Republican Party |
28 |
25 |
Independent |
34 |
26 |
Spark ML is a new library in Spark to build machine learning pipelines. This library is being developed along with MLlib. It helps to combine multiple machine learning algorithms into a single pipeline, and uses DataFrame as dataset.
Let's first understand some of the basic concepts in Spark ML. It uses transformers to transform one DataFrame into another DataFrame. One example of simple transformations can be to append a column. You can think of it as being equivalent to "alter table" in relational world.
Estimator, on the other hand, represents a machine learning algorithm, which learns from the data. Input to an estimator is a DataFrame and output is a transformer. Every Estimator has a fit()
method, which does the job of training the algorithm.
A machine learning pipeline is defined as a sequence of stages; each stage can be either an estimator or a transformer.
The example we are going to use in this recipe is whether someone is...