Reader small image

You're reading from  Hadoop Real-World Solutions Cookbook - Second Edition

Product typeBook
Published inMar 2016
Publisher
ISBN-139781784395506
Edition2nd Edition
Right arrow
Author (1)
Tanmay Deshpande
Tanmay Deshpande
author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande

Right arrow

Chapter 8. Machine Learning and Predictive Analytics Using Mahout and R

In this chapter, we'll cover the following recipes:

  • Setting up the Mahout development environment

  • Creating an item-based recommendation engine using Mahout

  • Creating a user-based recommendation engine using Mahout

  • Using predictive analytics for the marketing data of a bank

  • Clustering text data using K-Means

  • Performing population data analytics using R

  • Performing Twitter Sentiment Analytics using R

  • Performing Predictive Analytics using R

Introduction


In the previous chapter, we talked about how to automate Hadoop and its ecosystem tasks using Oozie. In this chapter, we will go deeper into the concepts of machine learning using Mahout and R. Mahout is a machine learning library, which allows us to solve machine learning problems with ease, whereas R is a statistical tool, which helps us build models. So, let's get started.

Setting up the Mahout development environment


In this recipe, we are going to take a look at how to set up the Mahout development environment.

Getting ready

To perform this recipe, you should have a running Hadoop cluster.

How to do it...

Setting up the Mahout environment is very easy:

  1. To start with, we first need to download the latest version of Mahout from http://www.apache.org/dyn/closer.cgi/mahout/.

  2. I am going to use version 0.11.1 ,which can be found at http://www.eu.apache.org/dist/mahout/0.11.1/apache-mahout-distribution-0.11.1.tar.gz.

  3. Next, unzip the tar and rename the folder as Mahout for simplicity's sake:

    sudo tar  -xzf  apache-mahout-distribution-0.11.1.tar.gz
    sudo mv apache-mahout-distribution-0.11.1 mahout
    
  4. To use the Mahout commands from everywhere, we add the distribution path to PATH.

    Edit ~/.bashrc and add the following commands to it:

    export MAHOUT_HOME=/usr/local/mahout
    export PATH=$PATH:$MAHOUT_HOME/bin
    
  5. Execute the following command to take a look at whether the changes are effective...

Creating an item-based recommendation engine using Mahout


In this recipe, we are going to take a look at how to use Mahout to generate item-based recommendations. Recommendation engine is one of the most seen use cases. A recommendation engine generates recommendations based on the input data provided to it. In this recipe, we are going to take a look at how to generate recommendations based on user preferences for certain items.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

Mahout provides built-in support for item-based recommendations. In order to execute a program using Mahout, we first need to prepare the input data and store it in a certain folder. The input data needs to be in a specified format (userId, itemId, and preference). Here, userId is the unique user identifier, itemId, is the unique item identifier, while the preference can be a rating given by a user to a specific item...

Creating a user-based recommendation engine using Mahout


In this recipe, we are going to take a look at how to use Mahout to generate user-based recommendations. The user-based recommendation engine is not available directly to be used as a Map Reduce job. We have to run it in a sequential manner, as described in the next section.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it. We will also need an Eclipse-like IDE or any other IDE of your choice for code development.

How to do it...

User-based recommendations work on the simple principle of user similarities and their likelihood toward the same set of items. To implement them, we first need to create a Maven project, and add the following dependency to it:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-mr</artifactId>
    <version>0.10.0</version>
</dependency>

Next, we create a...

Using Predictive analytics on Bank Data using Mahout


In this recipe, we are going to take a look at how to use Mahout to generate a predictive model and validate how good this model is against some sample data. Here, we will be using the sample data collected by a bank during their marketing operations.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

In this recipe, we are going to use Logistic Regression in order to predict the occurrence of an event. It uses predictors from the given data in order to calculate the probability. The Mahout implementation uses the Stochastic Gradient Descent (SGD) algorithm for logistic regression. You can learn more about SGD for logistic regression at http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/.

SGD is, by default, a sequential algorithm so we cannot run any parallel activities on it. Even though it is...

Clustering text data using K-Means


In this recipe, we are going to take a look at how to use Mahout to cluster text data using Mahout's implementation of the K-Means algorithm. K-Means is very popular clustering algorithm; you can read more about it at https://en.wikipedia.org/wiki/K-means_clustering.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

In this recipe, we are going to use Mahout's K Means algorithm to cluster the text data that is available. To do this, we first need to get some text data and copy it to HDFS:

hadoop fs –mkdir /kmeans
hadoop fs –put mydata.txt /kmeans/input

In order to execute the K-Means job on the given data, we first need to convert it into sequential files and from these sequential files to TF-IDF vectors. Mahout provides built-in utilities to perform these actions. The following are the commands to do this.

To convert text data into a sequential file, here is...

Performing Population Data Analytics using R


So far, we talked about how to use Mahout to solve various machine learning problems. Now, we are going to explain another tool/language called R, which has built-in support for various mathematical and statistical operations.

Getting ready

To perform this recipe, you should have R installed on your machine. You can download the installer from https://cran.r-project.org/bin/windows/base/.

How to do it...

In this recipe, we are going to learn some basic operations that one can perform using R. To start with, we will have one dataset that has information about Australia's population in various states. This is what the dataset looks like:

Year NSW Vic. Qld SA WA Tas. NT ACT Aust.
1917 1904 1409 683 440 306 193 5 3 4941
1927 2402 1727 873 565 392 211 4 8 6182
1937 2693 1853 993 589 457 233 6 11 6836
1947 2985 2055 1106 646 502 257 11 17 7579
1957 3625 2656 1413 873 688 326 21 38 9640
1967 4295 3274 1700 1110 879 375 62 103 11799
1977 5002 3837 2130 1286...

Performing Twitter Sentiment Analytics using R


In an earlier chapter, we saw how to perform Twitter sentiment analytics using Hive and Hadoop. In this recipe, we are going to take a look at how to do this using R.

Getting ready

To perform this recipe, you should have R installed on your machine. You should also have a Twitter account and an application that has an API key, API secret, Access Token, and an Access Secret with you so that you can receive tweets in real time.

How to do it...

To get started, first of all, we need to install certain R packages, which will be required in this recipe. The following are the commands:

>install.packages("twitteR")
>install.packages("plyr")
>install.packages("stringr")
>install.packages(c("devtools", "rjson", "bit64", "httr"))

Once the installation is complete, load the following packages:

>library(devtools)
>library(twitteR)

Next, we need to provide the keys provided that are by Twitter on its application page, as follows:

>api_key <...

Performing Predictive Analytics using R


In the previous recipe, we talked about how to perform sentiment analytics using R. In this recipe, we are going to take a look at how to perform predictive analytics using R. Here, we will be using the IRIS flower classification data in order to predict its species based on the features. You can learn more about this at https://en.wikipedia.org/wiki/Iris_flower_data_set.

Getting ready

To perform this recipe, you should have R installed on your machine.

How to do it...

To get started, we need to install an R package called e1071:

>install.packages("e1071")

This package contains the IRIS flower dataset. So, we load the library and then load the data into it:

>library(e1071)
>data(iris)

You can check whether the data is loaded properly or not by executing the following command:

>iris

In this example, we are going to use the Naive Bayes algorithm to classify the data into specifies. So, now we have to train the model using Naive Bayes, as shown here...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop Real-World Solutions Cookbook - Second Edition
Published in: Mar 2016Publisher: ISBN-13: 9781784395506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande