Reader small image

You're reading from  Machine Learning with Spark. - Second Edition

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785889936
Edition2nd Edition
Languages
Right arrow
Authors (2):
Rajdeep Dua
Rajdeep Dua
author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

Manpreet Singh Ghotra
Manpreet Singh Ghotra
author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra

View More author details
Right arrow

Math for Machine Learning

A machine learning user needs to have a fair understanding of machine learning concepts and algorithms. Familiarity with mathematics is an important aspect of machine learning. We learn to program by understanding the fundamental concepts and constructs of a language. Similarly, we learn machine learning by understanding concepts and algorithms using Mathematics, which is used to solve complex computational problems, and is a discipline for understanding and appreciating many computer science concepts. Mathematics plays a fundamental role in grasping theoretical concepts and in choosing the right algorithm. This chapter covers the basics of linear algebra and calculus for machine learning.

In this chapter, we will cover the following topics:

  • Linear algebra
  • Environment setup
    • Setting up the Scala environment in Intellij
    • Setting up the Scala environment on the command line
  • Fields
  • Vectors...

Linear algebra

Linear algebra is the study of solving a system of linear equations and transformations. Vectors, matrices, and determinants are the fundamental tools of linear algebra. We will learn each of these in detail using Breeze. Breeze is the underlying linear algebra library used for numerical processing. Respective Spark objects are wrappers around Breeze, and act as a public interface to ensure the consistency of the Spark ML library even if Breeze changes internally.

Setting up the Scala environment in Intellij

It is best to use an IDE like IntelliJ to edit Scala code, which provides faster development tools and coding assistance. Code completion and inspection makes coding and debugging faster and simpler, ensuring you focus on the end goal of learning...

Gradient descent

An SGD implementation of gradient descent uses a simple distributed sampling of the data examples. Loss is a part of the optimization problem, and therefore, is a true sub-gradient.

This requires access to the full dataset, which is not optimal.

The parameter miniBatchFraction specifies the fraction of the full data to use. The average of the gradients over this subset

is a stochastic gradient. S is a sampled subset of size |S|= miniBatchFraction.

In the following code, we show how to use stochastic gardient descent on a mini batch to calculate the weights and the loss. The output of this program is a vector of weights and loss.

object SparkSGD { 
def main(args: Array[String]): Unit = {
val m = 4
val n = 200000
val sc = new SparkContext("local[2]", "")
val points = sc.parallelize(0 until m,
2).mapPartitionsWithIndex { (idx, iter) =>
val random...

Prior, likelihood, and posterior

Bayes theorem states the following:

Posterior = Prior * Likelihood

This can also be stated as P (A | B) = (P (B | A) * P(A)) / P(B) , where P(A|B) is the probability of A given B, also called posterior.

Prior: Probability distribution representing knowledge or uncertainty of a data object prior or before observing it

Posterior: Conditional probability distribution representing what parameters are likely after observing the data object

Likelihood: The probability of falling under a specific category or class.

This is represented as follows:

Calculus

Calculus is a mathematical tool which helps the study of how things change. It provides a framework for modeling systems in which there is change, and a way to deduce the predictions of such models.

Differential calculus

At the core of calculus lie derivatives, where the derivative is defined as the instantaneous rate of change of a given function with respect to one of its variables. The study of finding a derivative is known as differentiation. Geometrically, the derivative at a known point is given by the slope of a tangent line to the graph of the function, provided that the derivative exists, and is defined at that point.

Differentiation is the reverse of Integration. Differentiation has several applications; like in physics, the derivative of displacement...

Plotting

In this segment, we will see how to use Breeze to create a simple line plot from the Breeze DenseVector.

Breeze uses most of the functionality of Scala's plotting facilities, although the API is different. In the following example, we create two vectors x1 and y with some values, and plot a line and save it to a PNG file:

package linalg.plot 
import breeze.linalg._
import breeze.plot._

object BreezePlotSampleOne {
def main(args: Array[String]): Unit = {

val f = Figure()
val p = f.subplot(0)
val x = DenseVector(0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8)
val y = DenseVector(1.1, 2.1, 0.5, 1.0,3.0, 1.1, 0.0, 0.5,2.5)
p += plot(x, y)
p.xlabel = "x axis"
p.ylabel = "y axis"
f.saveas("lines-graph.png")
}
}

The preceding code generates the following Line Plot:

Breeze also supports histogram. This is drawn for various sample sizes 100,000, and...

Summary

In this chapter, you learnt the basics of linear algebra, which is useful for machine learning, and the basic constructs like vectors and matrix. You also learnt how to use Spark and Breeze to do basic operations on these constructs. We looked at techniques like SVD to transform data. We also looked at the importance of the function types in linear algebra. In the end, you learnt how to plot basic charts using Breeze. In the next chapter, we will cover an overview of Machine Learning systems, components and architecture.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Spark. - Second Edition
Published in: Apr 2017Publisher: PacktISBN-13: 9781785889936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rajdeep Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.
Read more about Rajdeep Dua

author image
Manpreet Singh Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.
Read more about Manpreet Singh Ghotra