Reader small image

You're reading from  Apache Spark 2.x Machine Learning Cookbook

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551606
Edition1st Edition
Languages
Right arrow
Authors (5):
Mohammed Guller
Mohammed Guller
author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

Siamak Amirghodsi
Siamak Amirghodsi
author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

Shuen Mei
Shuen Mei
author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

Meenakshi Rajendran
Meenakshi Rajendran
author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

Broderick Hall
Broderick Hall
author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall

View More author details
Right arrow

Chapter 2. Just Enough Linear Algebra for Machine Learning with Spark

In this chapter, we will cover the following recipes:

  • Package imports and initial setup for vectors and matrices
  • Creating DenseVector and setup with Spark 2.0
  • Creating SparseVector and setup with Spark 2.0
  • Creating DenseMatrix and setup with Spark 2.0
  • Using sparse local matrices with Spark 2.0
  • Performing vector arithmetic using Spark 2.0
  • Performing matrix arithmetic with Spark 2.0
  • Distributed matrices in Spark 2.0 ML library
  • Exploring RowMatrix in Spark 2.0
  • Exploring distributed IndexedRowMatrix in Spark 2.0
  • Exploring distributed CoordinateMatrix in Spark 2.0
  • Exploring distributed BlockMatrix in Spark 2.0

Introduction


Linear algebra is the cornerstone of machine learning (ML) and mathematicalprogramming (MP). When dealing with Spark's machine library, one must understand that the Vector/Matrix structures by Scala (imported by default) are different from the Spark ML, MLlib Vector, Matrix facilities provided by Spark. The latter, powered by RDDs, is the desired data structure if you are going to use Spark (that is, parallelism) out of the box for large-scale matrix/vector computation (for example, SVD implementation alternatives with more numerical accuracy, desired in some cases for derivatives pricing and risk analytics). The Scala Vector/Matrix libraries provide a rich set of linear algebra operations such as dot product, additions, and so on, that still have their own place in an ML pipeline. In summary, the key difference between using Scala Breeze and Spark or Spark ML is that the Spark facility is backed by RDDs which allows for simultaneous distributed, concurrent computing, and resiliency...

Package imports and initial setup for vectors and matrices


Before we can program in Spark or use and matrix artifacts, we need to first import the right packages and then set up SparkSession so we can gain access to the cluster handle.

In this short recipe, we highlight a comprehensive number of packages that can cover most of the linear algebra operations in Spark. The individual recipes that follow will include the subset required for the specific program.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Set up the package location where the program will reside:
package spark.ml.cookbook.chapter2
  1.  Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.sql.{SparkSession...

Creating DenseVector and setup with Spark 2.0


In this recipe, we explore DenseVectors using the Spark 2.0 library.

Spark provides two types of vector facilities (dense and sparse) for storing and manipulating feature vectors that are going to be used in learning or optimization algorithms.

How to do it...

  1. In this section, we examine DenseVector examples that you would most likely use for implementing/augmenting existing machine learning programs. These examples also help to better understand Spark ML or MLlib source code and the underlying implementation (for example, Single Value Decomposition).
  2. Here we look at creating an ML vector feature (with independent variables) from arrays, which is a common use case. In this case, we have three almost fully populated Scala arrays corresponding to customer and product feature sets. We convert these arrays to the corresponding DenseVectors in Scala:
val CustomerFeatures1: Array[Double] = Array(1,3,5,7,9,1,3,2,4,5,6,1,2,5,3,7,4,3,4,1)
 val CustomerFeatures2...

Creating SparseVector and setup with Spark


In this recipe, we several types of SparseVector creation. As the length of the vector increases (millions) and the density remains low (few non-zero members), then sparse representation more and more advantageous over the DenseVector.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.SparseVector
  1. Set up the Spark context and application parameters so Spark can run. See the first recipe in this chapter for more details and variations:
val spark = SparkSession
 .builder
 .master("local[*]")
 .appName("myVectorMatrix")
 .config("spark.sql.warehouse.dir", ".")
 .getOrCreate()
  1. Here we look at creating a ML SparseVector that corresponds...

Creating dense matrix and setup with Spark 2.0


In this recipe, we explore creation examples that you most likely would need in your Scala programming and while reading the source code for many of the open source libraries for machine learning.

Spark provides two distinct types of local matrix facilities (dense and sparse) for storage and manipulation of data at a local level. For simplicity, one way to think of a is to visualize it as columns of Vectors.

Getting ready

The key to remember here is that the recipe covers local matrices stored on one machine. We will use another recipe, Distributed matrices in the Spark2.0 ML library, covered in this chapter, for storing and manipulating distributed matrices.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.sql.{SparkSession}
 import org.apache.spark.mllib.linalg._
 import breeze...

Using sparse local matrices with Spark 2.0


In this recipe, we concentrate on creation. In the recipe, we saw how a local dense matrix is declared and stored. A good number of machine learning problem domains can be represented as a set of features and labels within the matrix. In large-scale machine learning problems (for example, progression of a disease through large population centers, security fraud, political movement modeling, and so on), a good portion of the cells will be 0 or null (for example, the current number of people with a given disease versus the healthy population).

To help with storage and efficient operation in real time, sparse local matrices specialize in storing the cells efficiently as a list plus an index, which leads to faster loading and real time operations.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation: 
 import org...

Performing vector arithmetic using Spark 2.0


In this recipe, we explore addition in the Spark environment using the Breeze library for underlying operations. Vectors allow us to collect features and then manipulate them via linear algebra operations such as add, subtract, transpose, dot product, and so on.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
 import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
 import org.apache.spark.sql.{SparkSession}
 import org.apache.spark.mllib.linalg._
 import breeze.linalg.{DenseVector => BreezeVector}
 import Array._
 import org.apache.spark.mllib.linalg.DenseMatrix
 import org.apache.spark.mllib.linalg.SparseVector
  1. Set up the Spark session...

Performing matrix arithmetic using Spark 2.0


In this recipe, we explore matrix such as addition, transpose, and in Spark. The more complex operations such as inverse, SVD, and so on, will be covered in future sections. The native sparse and dense matrices for the Spark ML library provide multiplication operators so there is no need to convert to Breeze explicitly.

Matrices are the workhorses of distributed computing. ML features that are collected can be arranged in a matrix configuration and operated at scale. Many of the ML methods such as ALS (Alternating Least Square) and SVD (Singular Value Decomposition) rely on efficient matrix and vector operations to achieve large-scale machine learning and training.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
 import org.apache.spark.mllib...

Exploring RowMatrix in Spark 2.0


In this recipe, we explore the RowMatrix facility that is by Spark. RowMatrix, as the name implies, is a row-oriented matrix with the catch being the lack of an index that can be defined and carried through the computational life cycle of a RowMatrix. The rows are RDDs provide distributed computing and resiliency with fault tolerance.

The matrix is made of rows of local vectors that are parallelized and distributed via RDDs. In short, each row will be an RDD, but the total number of columns will be limited by the maximum size of a local vector. This is not an issue in most cases, but we felt we should mention it for completion.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix...

Exploring Distributed IndexedRowMatrix in Spark 2.0


In this recipe, we cover the IndexRowMatrix, which is the first distributed matrix that we cover in this chapter. The primary advantage of IndexedRowMatrix is that the index can be carried along with the row (RDD), which is the data itself.

In the case of IndexRowMatrix, we have an index by the developer which is permanently paired with a given row that is very useful for random access cases. The index not only helps with random access, but is also used for identifying the row itself when performing join() operations.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
    import org.apache.spark.mllib.linalg.distributed.RowMatrix
      import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
      import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry...

Exploring distributed CoordinateMatrix in Spark 2.0


In this recipe, we cover the second form of distributed matrix. This is very when dealing with ML implementations that need to deal with often large 3D coordinate systems (x, y, z). It is a convenient way to package the coordinate data structure into a distributed matrix.

How to do it...

  1. Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
  2. Import the necessary packages for vector and matrix manipulation:
 import org.apache.spark.mllib.linalg.distributed.RowMatrix
 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
 import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
 import org.apache.spark.sql.{SparkSession}
 import org.apache.spark.mllib.linalg._
 import breeze.linalg.{DenseVector => BreezeVector}
 import Array._
 import org.apache.spark.mllib.linalg.DenseMatrix
 import org.apache.spark.mllib.linalg.SparseVector
  1. Set up...

Exploring distributed BlockMatrix in Spark 2.0


In this recipe, we explore BlockMatrix, which is a nice and a placeholder for the block of other matrices. In short, it is a matrix of other matrices (matrix blocks) which can be accessed as a cell.

We take a look at a simplified code snippet by converting the CoordinateMatrix to a BlockMatrix and then do a quick check for its validity and access one of its properties to show that it was set up properly. BlockMatrix code takes longer to set up and it needs a real life application (not enough space) to demonstrate and show its properties in action.

How to do it...

  1. Start a new project in IntelliJ or in an editor of your choice and make sure all the necessary JAR files (Scala and Spark) are available to your application.
  2. Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
 import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
 import org.apache.spark...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Machine Learning Cookbook
Published in: Sep 2017Publisher: PacktISBN-13: 9781783551606
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653
Read more about Mohammed Guller

author image
Siamak Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Read more about Siamak Amirghodsi

author image
Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.
Read more about Shuen Mei

author image
Meenakshi Rajendran

Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale.
Read more about Meenakshi Rajendran

author image
Broderick Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Read more about Broderick Hall