Packt+ | Advance your knowledge in tech

You're reading from Hadoop Real-World Solutions Cookbook - Second Edition

Product typeBook

Published inMar 2016

Publisher

ISBN-139781784395506

Edition2nd Edition

Tools

Hadoop Eclipse

Concepts

Data Processing

Author (1)

Tanmay Deshpande

Chapter 8. Machine Learning and Predictive Analytics Using Mahout and R

In this chapter, we'll cover the following recipes:

Setting up the Mahout development environment
Creating an item-based recommendation engine using Mahout
Creating a user-based recommendation engine using Mahout
Using predictive analytics for the marketing data of a bank
Clustering text data using K-Means
Performing population data analytics using R
Performing Twitter Sentiment Analytics using R
Performing Predictive Analytics using R

Introduction

In the previous chapter, we talked about how to automate Hadoop and its ecosystem tasks using Oozie. In this chapter, we will go deeper into the concepts of machine learning using Mahout and R. Mahout is a machine learning library, which allows us to solve machine learning problems with ease, whereas R is a statistical tool, which helps us build models. So, let's get started.

Setting up the Mahout development environment

In this recipe, we are going to take a look at how to set up the Mahout development environment.

Getting ready

To perform this recipe, you should have a running Hadoop cluster.

How to do it...

Setting up the Mahout environment is very easy:

To start with, we first need to download the latest version of Mahout from http://www.apache.org/dyn/closer.cgi/mahout/.
I am going to use version 0.11.1 ,which can be found at http://www.eu.apache.org/dist/mahout/0.11.1/apache-mahout-distribution-0.11.1.tar.gz.

Next, unzip the tar and rename the folder as Mahout for simplicity's sake:

sudo tar  -xzf  apache-mahout-distribution-0.11.1.tar.gz
sudo mv apache-mahout-distribution-0.11.1 mahout

To use the Mahout commands from everywhere, we add the distribution path to PATH.
Edit ~/.bashrc and add the following commands to it:
```
export MAHOUT_HOME=/usr/local/mahout
export PATH=$PATH:$MAHOUT_HOME/bin
```
Execute the following command to take a look at whether the changes are effective...

Creating an item-based recommendation engine using Mahout

In this recipe, we are going to take a look at how to use Mahout to generate item-based recommendations. Recommendation engine is one of the most seen use cases. A recommendation engine generates recommendations based on the input data provided to it. In this recipe, we are going to take a look at how to generate recommendations based on user preferences for certain items.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

Mahout provides built-in support for item-based recommendations. In order to execute a program using Mahout, we first need to prepare the input data and store it in a certain folder. The input data needs to be in a specified format (userId, itemId, and preference). Here, userId is the unique user identifier, itemId, is the unique item identifier, while the preference can be a rating given by a user to a specific item...

Creating a user-based recommendation engine using Mahout

In this recipe, we are going to take a look at how to use Mahout to generate user-based recommendations. The user-based recommendation engine is not available directly to be used as a Map Reduce job. We have to run it in a sequential manner, as described in the next section.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it. We will also need an Eclipse-like IDE or any other IDE of your choice for code development.

How to do it...

User-based recommendations work on the simple principle of user similarities and their likelihood toward the same set of items. To implement them, we first need to create a Maven project, and add the following dependency to it:

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-mr</artifactId>
    <version>0.10.0</version>
</dependency>

Next, we create a...

Using Predictive analytics on Bank Data using Mahout

In this recipe, we are going to take a look at how to use Mahout to generate a predictive model and validate how good this model is against some sample data. Here, we will be using the sample data collected by a bank during their marketing operations.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

In this recipe, we are going to use Logistic Regression in order to predict the occurrence of an event. It uses predictors from the given data in order to calculate the probability. The Mahout implementation uses the Stochastic Gradient Descent (SGD) algorithm for logistic regression. You can learn more about SGD for logistic regression at http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/.

SGD is, by default, a sequential algorithm so we cannot run any parallel activities on it. Even though it is...

Clustering text data using K-Means

In this recipe, we are going to take a look at how to use Mahout to cluster text data using Mahout's implementation of the K-Means algorithm. K-Means is very popular clustering algorithm; you can read more about it at https://en.wikipedia.org/wiki/K-means_clustering.

Getting ready

To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Mahout installed on it.

How to do it...

In this recipe, we are going to use Mahout's K Means algorithm to cluster the text data that is available. To do this, we first need to get some text data and copy it to HDFS:

hadoop fs –mkdir /kmeans
hadoop fs –put mydata.txt /kmeans/input

In order to execute the K-Means job on the given data, we first need to convert it into sequential files and from these sequential files to TF-IDF vectors. Mahout provides built-in utilities to perform these actions. The following are the commands to do this.

To convert text data into a sequential file, here is...

Performing Population Data Analytics using R

So far, we talked about how to use Mahout to solve various machine learning problems. Now, we are going to explain another tool/language called R, which has built-in support for various mathematical and statistical operations.

Getting ready

To perform this recipe, you should have R installed on your machine. You can download the installer from https://cran.r-project.org/bin/windows/base/.

How to do it...

In this recipe, we are going to learn some basic operations that one can perform using R. To start with, we will have one dataset that has information about Australia's population in various states. This is what the dataset looks like:

Year NSW Vic. Qld SA WA Tas. NT ACT Aust.
1917 1904 1409 683 440 306 193 5 3 4941
1927 2402 1727 873 565 392 211 4 8 6182
1937 2693 1853 993 589 457 233 6 11 6836
1947 2985 2055 1106 646 502 257 11 17 7579
1957 3625 2656 1413 873 688 326 21 38 9640
1967 4295 3274 1700 1110 879 375 62 103 11799
1977 5002 3837 2130 1286...

Performing Twitter Sentiment Analytics using R

In an earlier chapter, we saw how to perform Twitter sentiment analytics using Hive and Hadoop. In this recipe, we are going to take a look at how to do this using R.

Getting ready

To perform this recipe, you should have R installed on your machine. You should also have a Twitter account and an application that has an API key, API secret, Access Token, and an Access Secret with you so that you can receive tweets in real time.

How to do it...

To get started, first of all, we need to install certain R packages, which will be required in this recipe. The following are the commands:

>install.packages("twitteR")
>install.packages("plyr")
>install.packages("stringr")
>install.packages(c("devtools", "rjson", "bit64", "httr"))

Once the installation is complete, load the following packages:

>library(devtools)
>library(twitteR)

Next, we need to provide the keys provided that are by Twitter on its application page, as follows:

>api_key <...

Performing Predictive Analytics using R

In the previous recipe, we talked about how to perform sentiment analytics using R. In this recipe, we are going to take a look at how to perform predictive analytics using R. Here, we will be using the IRIS flower classification data in order to predict its species based on the features. You can learn more about this at https://en.wikipedia.org/wiki/Iris_flower_data_set.

Getting ready

To perform this recipe, you should have R installed on your machine.

How to do it...

To get started, we need to install an R package called e1071:

>install.packages("e1071")

This package contains the IRIS flower dataset. So, we load the library and then load the data into it:

>library(e1071)
>data(iris)

You can check whether the data is loaded properly or not by executing the following command:

>iris

In this example, we are going to use the Naive Bayes algorithm to classify the data into specifies. So, now we have to train the model using Naive Bayes, as shown here...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop Real-World Solutions Cookbook - Second Edition

Published in: Mar 2016Publisher: ISBN-13: 9781784395506

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages