Packt+ | Advance your knowledge in tech

You're reading from Spark Cookbook

Product typeBook

Published inJul 2015

Publisher

ISBN-139781783987061

Edition1st Edition

Tools

Apache Spark

Concepts

Data Analysis

Author (1)

Rishi Yadav

Chapter 6. Getting Started with Machine Learning Using MLlib

This chapter is divided into the following recipes:

Creating vectors
Creating a labeled point
Creating matrices
Calculating summary statistics
Calculating correlation
Doing hypothesis testing
Creating machine learning pipelines using ML

Introduction

The following is Wikipedia's definition of machine learning:

"Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data."

Essentially, machine learning is making use of past data to make predictions about the future. Machine learning heavily depends upon statistical analysis and methodology.

In statistics, there are four types of measurement scales:

Scale type	Description
Nominal Scale	=, ≠ Identifies categories Can't be numeric Example: male, female
Ordinal Scale	=, ≠, <, > Nominal scale + Ranks from least important to most important Example: corporate hierarchy
Interval Scale	=, ≠, <, >, +, - Ordinal scale + distance between observations Numbers assigned to observations indicate order Difference between any consecutive values is same as others 60° temperature is not the double of 30°
Ratio Scale	=, ≠, <, >, +, ×, ÷ Interval scale +ratios of observations $20 is twice as costly as $10

Another...

Creating vectors

Before understanding Vectors, let's focus on what is a point. A point is just a set of numbers. This set of numbers or coordinates defines the point's position in space. The numbers of coordinates determine dimensions of the space.

We can visualize space with up to three dimensions. Space with more than three dimensions is called hyperspace. Let's put this spatial metaphor to use.

Let's start with a person. A person has the following dimensions:

Weight
Height
Age

We are working in three-dimensional space here. Thus, the interpretation of point (160,69,24) would be 160 lb weight, 69 inches height, and 24 years age.

Note

Points and vectors are same thing. Dimensions in vectors are called features. In another way, we can define a feature as an individual measurable property of a phenomenon being observed.

Spark has local vectors and matrices and also distributed matrices. Distributed matrix is backed by one or more RDDs. A local vector has numeric indices and double values, and is stored...

Creating a labeled point

Labeled point is a local vector (sparse/dense), which has an associated label with it. Labeled data is used in supervised learning to help train algorithms. You will get to know more about it in the next chapter.

Label is stored as a double value in LabeledPoint. It means that when you have categorical labels, they need to be mapped to double values. What value you assign to a category is immaterial and is only a matter of convenience.

Type	Label values
Binary classification	0 or 1
Multiclass classification	0, 1, 2…
Regression	Decimal values

How to do it…

Start the Spark shell:
```
$spark-shell
```

Import the MLlib vector explicitly:

scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}

Import the LabeledPoint:

scala> import org.apache.spark.mllib.regression.LabeledPoint

Create a labeled point with a positive label and dense vector:

scala> val willBuySUV = LabeledPoint(1.0,Vectors.dense(300.0,80,40))

Create a labeled point with a negative label and dense...

Creating matrices

Matrix is simply a table to represent multiple feature vectors. A matrix that can be stored on one machine is called local matrix and the one that can be distributed across the cluster is called distributed matrix.

Local matrices have integer-based indices, while distributed matrices have long-based indices. Both have values as doubles.

There are three types of distributed matrices:

RowMatrix: This has each row as a feature vector.
IndexedRowMatrix: This also has row indices.
CoordinateMatrix: This is simply a matrix of MatrixEntry. A MatrixEntry represents an entry in the matrix represented by its row and column index.

How to do it…

Start the Spark shell:
```
$spark-shell
```

Import the matrix-related classes:

scala> import org.apache.spark.mllib.linalg.{Vectors,Matrix, Matrices}

Create a dense local matrix:

scala> val people = Matrices.dense(3,2,Array(150d,60d,25d, 300d,80d,40d))

Create a personRDD as RDD of vectors:

scala> val personRDD = sc.parallelize(List(Vectors.dense...

Calculating summary statistics

Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:

Central tendency of data—mean, mode, median
Spread of data—variance, standard deviation
Boundary conditions—min, max

This recipe covers how to produce summary statistics.

How to do it…

Start the Spark shell:
```
$ spark-shell
```

Import the matrix-related classes:

scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
scala> import org.apache.spark.mllib.stat.Statistics

Create a personRDD as RDD of vectors:

scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))

Compute the column summary statistics:

scala> val summary = Statistics.colStats(personRDD)

Print the mean of this summary:
```
scala> print(summary.mean)
```
Print the variance:
```
scala> print(summary.variance)
```
Print the non-zero values in each column:
```
scala> print(summary.numNonzeros)
```
Print the sample size:
```
scala> print(summary...
```

Calculating correlation

Correlation is a statistical relationship between two variables such that when one variable changes, it leads to a change in the other variable. Correlation analysis measures the extent to which the two variables are correlated.

If an increase in one variable leads to an increase in another, it is called a positive correlation. If an increase in one variable leads to a decrease in the other, it is a negative correlation.

Spark supports two correlation algorithms: Pearson and Spearman. Pearson algorithm works with two continuous variables, such as a person's height and weight or house size and house price. Spearman deals with one continuous and one categorical variable, for example, zip code and house price.

Getting ready

Let's use some real data so that we can calculate correlation more meaningfully. The following are the size and price of houses in the City of Saratoga, California, in early 2014:

Doing hypothesis testing

Hypothesis testing is a way of determining probability that a given hypothesis is true. Let's say a sample data suggests that females tend to vote more for the Democratic Party. This may or may not be true for the larger population. What if this pattern is there in the sample data just by chance?

Another way to look at the goal of hypothesis testing is to answer this question: If a sample has a pattern in it, what are the chances of the pattern being there just by chance?

How do we do it? There is a saying that the best way to prove something is to try to disprove it.

The hypothesis to disprove is called null hypothesis. Hypothesis testing works with categorical data. Let's look at the example of a gallop poll of party affiliations.

House size (sq ft)	Price
2100	$1,620,000
2300	$1,690,000
2046	$1...

Party	Male	Female
Democratic Party	32	41
Republican Party	28	25
Independent	34	26

How to do it…

Start the Spark shell:
```
$ spark-shell
```

Import the relevant classes:

scala> import org.apache.spark.mllib.stat.Statistics
scala>...

Creating machine learning pipelines using ML

Spark ML is a new library in Spark to build machine learning pipelines. This library is being developed along with MLlib. It helps to combine multiple machine learning algorithms into a single pipeline, and uses DataFrame as dataset.

Getting ready

Let's first understand some of the basic concepts in Spark ML. It uses transformers to transform one DataFrame into another DataFrame. One example of simple transformations can be to append a column. You can think of it as being equivalent to "alter table" in relational world.

Estimator, on the other hand, represents a machine learning algorithm, which learns from the data. Input to an estimator is a DataFrame and output is a transformer. Every Estimator has a fit() method, which does the job of training the algorithm.

A machine learning pipeline is defined as a sequence of stages; each stage can be either an estimator or a transformer.

The example we are going to use in this recipe is whether someone is...

The rest of the chapter is locked

You have been reading a chapter from

Spark Cookbook

Published in: Jul 2015Publisher: ISBN-13: 9781783987061

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages