Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Mahout

You're reading from  Learning Apache Mahout

Product type Book
Published in Mar 2015
Publisher
ISBN-13 9781783555215
Pages 250 pages
Edition 1st Edition
Languages

Chapter 8. New Paradigm in Mahout

Mahout started out primarily as a Java MapReduce package to run distributed and scalable machine learning algorithms on top of Hadoop. As the Mahout Project matures, it has taken a decision to move out of MapReduce and embrace Apache Spark and other distributed processing frameworks, such as H20, with a focus on write once and run on multiple platforms. In this chapter, we are going to discuss:

  • Limitations of MapReduce

  • Apache Spark

  • In-core binding

  • Out-of-core binding

MapReduce and HDFS were two paradigms largely responsible for a quantum shift in data processing capability. With increased capabilities, we learned to imagine larger problems that kick started a whole new industry of Big Data Analytics. The last decade has been amazing for solving data-related problems. However, in recent times, a lot of effort has been put into developing processing paradigms beyond MapReduce. These efforts are either aimed at replacing MapReduce or augmenting the processing framework...

Moving beyond MapReduce


Let's discuss why we need to move beyond MapReduce. Based on the scenario and use case, there are many advantages and limitations of MapReduce. In this section, we will concern ourselves with the limitations that impact machine learning use cases.

Firstly, MapReduce is not feasible when the intermediate processes need to talk to each other. A lot of machine learning algorithms need to work based on a shared global state, which is difficult to implement with MapReduce.

Secondly, quite a few problems are difficult to break down into map and reduce phases. Mahout is porting to Apache Spark, which works on top of HDFS and provides a processing paradigm other than MapReduce.

Apache Spark


Spark was developed as a general-purpose engine for large-scale data processing. It recently released its 1.0 version. Spark has two important features.

The first feature that Spark has is a resilient distributed dataset (RDD). This is a collection of elements partitioned across the nodes of a cluster, which can be operated on in parallel. A file on HDFS or any existing Scala collection can be converted to an RDD collection, and any operation on it can be executed in parallel. RDDs can also be requested to persist in memory, which leads to efficient parallel operations. RDDs have automatic fail-over support and can recover from node failures.

The second important feature of Spark is the concept of shared variables that can be used in any parallel operations. Spark supports two types of shared variables: broadcast variables and accumulators. Broadcast variables can be used to cache a value in memory on all the nodes, whereas accumulators are variables that can only be added up...

In-core types


Vector and Matrices are of type in-core or in-memory. We will try out some basic commands to get a feel of the linear algebra operations possible.

Vector

We will first discuss vectors and then cover matrices. We will see some examples of operations that can be performed on vectors.

Initializing a vector inline

Dense vector: The dense vector is a vector with relatively fewer zero elements. On the Mahout command line, please type the following command to initialize a dense vector:

mahout>val denseVec1: Vector = (1.0, 1.1, 1.2)

Each element is prefixed by its index, which starts with 0. The output of the command executed is given as follows:

denseVec1: org.apache.mahout.math.Vector = {0:1.0,1:1.1,2:1.2}

Sparse vector: Sparse vector is a vector with a relatively large number of zero elements. On the Mahout command line, please type the following command to initialize a sparse vector:

mahout>val sparseVec = svec((5 -> 1) :: (10 -> 2.0) :: Nil)

The output of the command...

Spark Mahout basics


We will now focus on Mahout Spark's DRM. DRM, once loaded into Spark, is partitioned by rows of the DRM.

Initializing the Spark context

Many operations on the DRM will require a Spark context. To initialize Mahout with the Spark session, we create the implicit variable mahoutCtx as the Spark context:

implicit val mahoutCtx = mahoutSparkContext(
masterUrl = "spark://ctiwary-gsu-hyd:7077",
appName = "MahoutLocalContext"
)
We will import some import
// Import matrix, vector types, etc.
import org.apache.mahout.math._
// Import scala bindings operations
import scalabindings._
// Enable R-like dialect in scala bindings
import RLikeOps._
// Import distributed matrix apis
import drm._
// Import R-like distributed dialect
import RLikeDrmOps._
// Those are needed for Spark-specific
// operations such as context creation.
// 100% engine-agnostic code does not
// require these.
import org.apache.mahout.sparkbindings._
// A good idea when working with mixed
// scala/java iterators...

Linear regression with Mahout Spark


We will discuss the linear regression example mentioned on the Mahout Wiki. Let's first create the training data in the form of a parallel DRM:

val drmData = drmParallelize(dense(
  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
  (2, 1, 11,   13, 32.207582),  // Froot Loops
  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
  (6, 2, 17,   1,  50.764999),  // Cheerios
  (3, 2, 13,   7,  40.400208),  // Clusters
  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
  numPartitions = 2);

The first four columns will be our feature vector and the last column will be our target variable. We will separate out the feature matrix and the target vector, drmX being the feature matrix and y being the target vector:

val drmX = drmData(::, 0 until 4)

The target variable is collected into the memory using the...

Summary


We briefly discussed Mahout and Spark bindings. This is the future of Mahout, though a production-ready release is some time away. We learned the basic operations that can be performed on the various data structures and went through an example of applying these techniques to build a machine learning algorithm. I would encourage you to keep yourself updated on the development of Mahout and Spark bindings, and the best way would be to follow the Mahout Wiki.

In the next chapter, we will discuss end-to-end practical use cases of customer analytics. Most of the techniques used so far will be put into practice, and you will get an idea of a real-life analytics project.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Mahout
Published in: Mar 2015 Publisher: ISBN-13: 9781783555215
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}