Reader small image

You're reading from  Spark Cookbook

Product typeBook
Published inJul 2015
Publisher
ISBN-139781783987061
Edition1st Edition
Right arrow
Author (1)
Rishi Yadav
Rishi Yadav
author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Right arrow

Chapter 6. Getting Started with Machine Learning Using MLlib

This chapter is divided into the following recipes:

  • Creating vectors

  • Creating a labeled point

  • Creating matrices

  • Calculating summary statistics

  • Calculating correlation

  • Doing hypothesis testing

  • Creating machine learning pipelines using ML

Introduction


The following is Wikipedia's definition of machine learning:

"Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data."

Essentially, machine learning is making use of past data to make predictions about the future. Machine learning heavily depends upon statistical analysis and methodology.

In statistics, there are four types of measurement scales:

Scale type

Description

Nominal Scale

=, ≠

Identifies categories

Can't be numeric

Example: male, female

Ordinal Scale

=, ≠, <, >

Nominal scale +

Ranks from least important to most important

Example: corporate hierarchy

Interval Scale

=, ≠, <, >, +, -

Ordinal scale + distance between observations

Numbers assigned to observations indicate order

Difference between any consecutive values is same as others

60° temperature is not the double of 30°

Ratio Scale

=, ≠, <, >, +, ×, ÷

Interval scale +ratios of observations

$20 is twice as costly as $10

Another...

Creating vectors


Before understanding Vectors, let's focus on what is a point. A point is just a set of numbers. This set of numbers or coordinates defines the point's position in space. The numbers of coordinates determine dimensions of the space.

We can visualize space with up to three dimensions. Space with more than three dimensions is called hyperspace. Let's put this spatial metaphor to use.

Let's start with a person. A person has the following dimensions:

  • Weight

  • Height

  • Age

We are working in three-dimensional space here. Thus, the interpretation of point (160,69,24) would be 160 lb weight, 69 inches height, and 24 years age.

Note

Points and vectors are same thing. Dimensions in vectors are called features. In another way, we can define a feature as an individual measurable property of a phenomenon being observed.

Spark has local vectors and matrices and also distributed matrices. Distributed matrix is backed by one or more RDDs. A local vector has numeric indices and double values, and is stored...

Creating a labeled point


Labeled point is a local vector (sparse/dense), which has an associated label with it. Labeled data is used in supervised learning to help train algorithms. You will get to know more about it in the next chapter.

Label is stored as a double value in LabeledPoint. It means that when you have categorical labels, they need to be mapped to double values. What value you assign to a category is immaterial and is only a matter of convenience.

Type

Label values

Binary classification

0 or 1

Multiclass classification

0, 1, 2…

Regression

Decimal values

How to do it…

  1. Start the Spark shell:

    $spark-shell
    
  2. Import the MLlib vector explicitly:

    scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
    
  3. Import the LabeledPoint:

    scala> import org.apache.spark.mllib.regression.LabeledPoint
    
  4. Create a labeled point with a positive label and dense vector:

    scala> val willBuySUV = LabeledPoint(1.0,Vectors.dense(300.0,80,40))
    
  5. Create a labeled point with a negative label and dense...

Creating matrices


Matrix is simply a table to represent multiple feature vectors. A matrix that can be stored on one machine is called local matrix and the one that can be distributed across the cluster is called distributed matrix.

Local matrices have integer-based indices, while distributed matrices have long-based indices. Both have values as doubles.

There are three types of distributed matrices:

  • RowMatrix: This has each row as a feature vector.

  • IndexedRowMatrix: This also has row indices.

  • CoordinateMatrix: This is simply a matrix of MatrixEntry. A MatrixEntry represents an entry in the matrix represented by its row and column index.

How to do it…

  1. Start the Spark shell:

    $spark-shell
    
  2. Import the matrix-related classes:

    scala> import org.apache.spark.mllib.linalg.{Vectors,Matrix, Matrices}
    
  3. Create a dense local matrix:

    scala> val people = Matrices.dense(3,2,Array(150d,60d,25d, 300d,80d,40d))
    
  4. Create a personRDD as RDD of vectors:

    scala> val personRDD = sc.parallelize(List(Vectors.dense...

Calculating summary statistics


Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:

  • Central tendency of data—mean, mode, median

  • Spread of data—variance, standard deviation

  • Boundary conditions—min, max

This recipe covers how to produce summary statistics.

How to do it…

  1. Start the Spark shell:

    $ spark-shell
    
  2. Import the matrix-related classes:

    scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
    scala> import org.apache.spark.mllib.stat.Statistics
    
  3. Create a personRDD as RDD of vectors:

    scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))
    
  4. Compute the column summary statistics:

    scala> val summary = Statistics.colStats(personRDD)
    
  5. Print the mean of this summary:

    scala> print(summary.mean)
    
  6. Print the variance:

    scala> print(summary.variance)
    
  7. Print the non-zero values in each column:

    scala> print(summary.numNonzeros)
    
  8. Print the sample size:

    scala> print(summary...

Calculating correlation


Correlation is a statistical relationship between two variables such that when one variable changes, it leads to a change in the other variable. Correlation analysis measures the extent to which the two variables are correlated.

If an increase in one variable leads to an increase in another, it is called a positive correlation. If an increase in one variable leads to a decrease in the other, it is a negative correlation.

Spark supports two correlation algorithms: Pearson and Spearman. Pearson algorithm works with two continuous variables, such as a person's height and weight or house size and house price. Spearman deals with one continuous and one categorical variable, for example, zip code and house price.

Getting ready

Let's use some real data so that we can calculate correlation more meaningfully. The following are the size and price of houses in the City of Saratoga, California, in early 2014:

Doing hypothesis testing


Hypothesis testing is a way of determining probability that a given hypothesis is true. Let's say a sample data suggests that females tend to vote more for the Democratic Party. This may or may not be true for the larger population. What if this pattern is there in the sample data just by chance?

Another way to look at the goal of hypothesis testing is to answer this question: If a sample has a pattern in it, what are the chances of the pattern being there just by chance?

How do we do it? There is a saying that the best way to prove something is to try to disprove it.

The hypothesis to disprove is called null hypothesis. Hypothesis testing works with categorical data. Let's look at the example of a gallop poll of party affiliations.

House size (sq ft)

Price

2100

$1,620,000

2300

$1,690,000

2046

$1...

Party

Male

Female

Democratic Party

32

41

Republican Party

28

25

Independent

34

26

How to do it…

  1. Start the Spark shell:

    $ spark-shell
    
  2. Import the relevant classes:

    scala> import org.apache.spark.mllib.stat.Statistics
    scala>...

Creating machine learning pipelines using ML


Spark ML is a new library in Spark to build machine learning pipelines. This library is being developed along with MLlib. It helps to combine multiple machine learning algorithms into a single pipeline, and uses DataFrame as dataset.

Getting ready

Let's first understand some of the basic concepts in Spark ML. It uses transformers to transform one DataFrame into another DataFrame. One example of simple transformations can be to append a column. You can think of it as being equivalent to "alter table" in relational world.

Estimator, on the other hand, represents a machine learning algorithm, which learns from the data. Input to an estimator is a DataFrame and output is a transformer. Every Estimator has a fit() method, which does the job of training the algorithm.

A machine learning pipeline is defined as a sequence of stages; each stage can be either an estimator or a transformer.

The example we are going to use in this recipe is whether someone is...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Spark Cookbook
Published in: Jul 2015Publisher: ISBN-13: 9781783987061
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav