Reader small image

You're reading from  Apache Spark 2.x Machine Learning Cookbook

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551606
Edition1st Edition
Languages
Right arrow
Authors (5):
Mohammed Guller
Mohammed Guller
author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi
Siamak Amirghodsi
author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei
Shuen Mei
author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran
Meenakshi Rajendran
author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall
Broderick Hall
author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

View More author details
Right arrow

Chapter 9. Optimization - Going Down the Hill with Gradient Descent

In this chapter, we will cover:

  • Optimizing a quadratic cost function and finding the minima using just math to gain insight
  • Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch
  • Coding Gradient Descent optimization to solve Linear regression from scratch
  • Normal equations as an alternative to solve Linear regression in Spark 2.0

Introduction


Understanding how optimization works is fundamental for a successful career in machine learning. We picked the Gradient Descent (GD) method for an end-to-end deep dive to demonstrate the inner workings of an optimization technique. We will develop the concept using three recipes that walk the developer from scratch to a fully developed code to solve an actual problem with real-world data. The fourth recipe explores an alternative to GD using Spark and normal equations (limited scaling for big data problems) to solve a regression problem.

Let's get started. How does a machine learn anyway? Does it really learn from its mistakes? What does it mean when the machine finds a solution using optimization?

At a high level, machines learn based on one of the following five techniques:

  • Error based learning: In this technique, we search the domain space for a combination of parameter values (weights) that minimize the total error (predicted versus actual) over the training data.
  • Information...

Optimizing a quadratic cost function and finding the minima using just math to gain insight


In this recipe, we will explore the fundamental concept behind mathematical optimization using simple derivatives before introducing Gradient Descent (first order derivative) and L-BFGS, which is a Hessian free quasi-Newton method.

We will examine a sample quadratic cost/error and show how to find the or maximum with just math.

We will use both the closed form (vertex formula) and derivative method (slope) to find the minima, but we will defer to later recipes in this chapter to introduce numerical optimization techniques, such Gradient Descent and its application to regression.

How to do it...

  1. Let's assume we have a quadratic cost function and we find its minima:
  1. The cost function in statistical machine learning algorithms acts as a proxy for the level of difficulty, energy spent, or total error as we move around in our search space.

 

  1. The first thing we do is to graph the function and inspect it visually...

Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch


In this recipe, we will code an iterative optimization technique called gradient descent (GD) to find the minimum of a quadratic function f(x) = 2x2 - 8x +9.

The focus here shifts from using math to solve for the minima (setting the first derivative to zero) to an iterative numerical method called Gradient Descent (GD) which with a guess and then gets closer to the solution in each iteration using a cost/error function as the guideline.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
  1. Set up the path using the package directive: package spark.ml.cookbook.chapter9.
  1. Import the necessary packages.

The scala.util.control.Breaks will allow us to break out of the program. We use this during the debugging phase only when the program fails to converge or gets stuck in a never ending process (for example, when the step size is too large).

import...

Coding Gradient Descent optimization to solve Linear Regression from scratch


In this recipe, we will explore how to code Descent to solve a Linear Regression problem. In the previous recipe, we demonstrated how to code GD to find the minimum of a quadratic function.

This recipe demonstrates a more realistic optimization problem in which we optimize (minimize) the least square cost function to solve the linear regression problem in Scala on Apache Spark 2.0+. We will use real data and run our algorithm and compare the result to a tier-1 commercially available statistic software to demonstrate accuracy and speed.

How to do it...

  1. We start by downloading the file from Princeton University which contains the following data:

Source: Princeton University

  1. To keep things simple, we then select the yr and sl to study how the number of years in rank influences the salary. To cut down on data wrangling code, we save those two columns in...

Normal equations as an alternative for solving Linear Regression in Spark 2.0


In this recipe, we present an alternative to Gradient Descent (GD) and LBFGS by using Normal Equations to solve linear regression. In the case of normal equations, you are setting up your regression as a matrix of features and vector of labels (dependent variables) while trying to solve it by using matrix such as inverse, transpose, and so on.

The emphasis here is to highlight Spark's facility for using Equations to solve Linear Regression and not the details of the model or generated coefficients.

How to do it...

  1. We use the same housing dataset which we extensively covered in Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I and Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II, which relate various attributes (for example number of rooms, and so on) to the price of the house.

The data is available as housing8.csv under the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Machine Learning Cookbook
Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall