Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Mahout

Product type Book

Published in Mar 2015

Publisher

ISBN-13 9781783555215

Pages 250 pages

Edition 1st Edition

Languages

Java

Concepts

Machine Learning

Table of Contents (17) Chapters

Learning Apache Mahout

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Introduction to Mahout

Core Concepts in Machine Learning

Feature Engineering

Classification with Mahout

Frequent Pattern Mining and Topic Modeling

Recommendation with Mahout

Clustering with Mahout

New Paradigm in Mahout

Case Study – Churn Analytics and Customer Segmentation

Case Study – Text Analytics

Index

Chapter 8. New Paradigm in Mahout

Mahout started out primarily as a Java MapReduce package to run distributed and scalable machine learning algorithms on top of Hadoop. As the Mahout Project matures, it has taken a decision to move out of MapReduce and embrace Apache Spark and other distributed processing frameworks, such as H20, with a focus on write once and run on multiple platforms. In this chapter, we are going to discuss:

Limitations of MapReduce
Apache Spark
In-core binding
Out-of-core binding

MapReduce and HDFS were two paradigms largely responsible for a quantum shift in data processing capability. With increased capabilities, we learned to imagine larger problems that kick started a whole new industry of Big Data Analytics. The last decade has been amazing for solving data-related problems. However, in recent times, a lot of effort has been put into developing processing paradigms beyond MapReduce. These efforts are either aimed at replacing MapReduce or augmenting the processing framework...

Moving beyond MapReduce

Let's discuss why we need to move beyond MapReduce. Based on the scenario and use case, there are many advantages and limitations of MapReduce. In this section, we will concern ourselves with the limitations that impact machine learning use cases.

Firstly, MapReduce is not feasible when the intermediate processes need to talk to each other. A lot of machine learning algorithms need to work based on a shared global state, which is difficult to implement with MapReduce.

Secondly, quite a few problems are difficult to break down into map and reduce phases. Mahout is porting to Apache Spark, which works on top of HDFS and provides a processing paradigm other than MapReduce.

Apache Spark

Spark was developed as a general-purpose engine for large-scale data processing. It recently released its 1.0 version. Spark has two important features.

The first feature that Spark has is a resilient distributed dataset (RDD). This is a collection of elements partitioned across the nodes of a cluster, which can be operated on in parallel. A file on HDFS or any existing Scala collection can be converted to an RDD collection, and any operation on it can be executed in parallel. RDDs can also be requested to persist in memory, which leads to efficient parallel operations. RDDs have automatic fail-over support and can recover from node failures.

The second important feature of Spark is the concept of shared variables that can be used in any parallel operations. Spark supports two types of shared variables: broadcast variables and accumulators. Broadcast variables can be used to cache a value in memory on all the nodes, whereas accumulators are variables that can only be added up...

In-core types

Vector and Matrices are of type in-core or in-memory. We will try out some basic commands to get a feel of the linear algebra operations possible.

Vector

We will first discuss vectors and then cover matrices. We will see some examples of operations that can be performed on vectors.

Initializing a vector inline

Dense vector: The dense vector is a vector with relatively fewer zero elements. On the Mahout command line, please type the following command to initialize a dense vector:

mahout>val denseVec1: Vector = (1.0, 1.1, 1.2)

Each element is prefixed by its index, which starts with 0. The output of the command executed is given as follows:

denseVec1: org.apache.mahout.math.Vector = {0:1.0,1:1.1,2:1.2}

Sparse vector: Sparse vector is a vector with a relatively large number of zero elements. On the Mahout command line, please type the following command to initialize a sparse vector:

mahout>val sparseVec = svec((5 -> 1) :: (10 -> 2.0) :: Nil)

The output of the command...

Spark Mahout basics

We will now focus on Mahout Spark's DRM. DRM, once loaded into Spark, is partitioned by rows of the DRM.

Initializing the Spark context

Many operations on the DRM will require a Spark context. To initialize Mahout with the Spark session, we create the implicit variable mahoutCtx as the Spark context:

implicit val mahoutCtx = mahoutSparkContext(
masterUrl = "spark://ctiwary-gsu-hyd:7077",
appName = "MahoutLocalContext"
)
We will import some import
// Import matrix, vector types, etc.
import org.apache.mahout.math._
// Import scala bindings operations
import scalabindings._
// Enable R-like dialect in scala bindings
import RLikeOps._
// Import distributed matrix apis
import drm._
// Import R-like distributed dialect
import RLikeDrmOps._
// Those are needed for Spark-specific
// operations such as context creation.
// 100% engine-agnostic code does not
// require these.
import org.apache.mahout.sparkbindings._
// A good idea when working with mixed
// scala/java iterators...

Linear regression with Mahout Spark

We will discuss the linear regression example mentioned on the Mahout Wiki. Let's first create the training data in the form of a parallel DRM:

val drmData = drmParallelize(dense(
  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
  (2, 1, 11,   13, 32.207582),  // Froot Loops
  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
  (6, 2, 17,   1,  50.764999),  // Cheerios
  (3, 2, 13,   7,  40.400208),  // Clusters
  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
  numPartitions = 2);

The first four columns will be our feature vector and the last column will be our target variable. We will separate out the feature matrix and the target vector, drmX being the feature matrix and y being the target vector:

val drmX = drmData(::, 0 until 4)

The target variable is collected into the memory using the...

Summary

We briefly discussed Mahout and Spark bindings. This is the future of Mahout, though a production-ready release is some time away. We learned the basic operations that can be performed on the various data structures and went through an example of applying these techniques to build a machine learning algorithm. I would encourage you to keep yourself updated on the development of Mahout and Spark bindings, and the best way would be to follow the Mahout Wiki.

In the next chapter, we will discuss end-to-end practical use cases of customer analytics. Most of the techniques used so far will be put into practice, and you will get an idea of a real-life analytics project.