Learning Data Analytics with R and Hadoop

(For more resources related to this topic, see here.)

Understanding the data analytics project life cycle

While dealing with the data analytics projects, there are some fixed tasks that should be followed to get the expected output. So here we are going to build a data analytics project cycle, which will be a set of standard data-driven processes to lead data to insights effectively. The defined data analytics processes of a project life cycle should be followed by sequences for effectively achieving the goal using input datasets. This data analytics process may include identifying the data analytics problems, designing, and collecting datasets, data analytics, and data visualization.

The data analytics project life cycle stages are seen in the following diagram:

Let's get some perspective on these stages for performing data analytics.

Identifying the problem

Today, business analytics trends change by performing data analytics over web datasets for growing business. Since their data size is increasing gradually day by day, their analytical application needs to be scalable for collecting insights from their datasets.

With the help of web analytics; we can solve the business analytics problems. Let's assume that we have a large e-commerce website, and we want to know how to increase the business. We can identify the important pages of our website by categorizing them as per popularity into high, medium, and low. Based on these popular pages, their types, their traffic sources, and their content, we will be able to decide the roadmap to improve business by improving web traffic, as well as content.

Designing data requirement

To perform the data analytics for a specific problem, it needs datasets from related domains. Based on the domain and problem specification, the data source can be decided and based on the problem definition; the data attributes of these datasets can be defined.

For example, if we are going to perform social media analytics (problem specification), we use the data source as Facebook or Twitter. For identifying the user characteristics, we need user profile information, likes, and posts as data attributes.

Preprocessing data

In data analytics, we do not use the same data sources, data attributes, data tools, and algorithms all the time as all of them will not use data in the same format. This leads to the performance of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting, to provide the data in a supported format to all the data tools as well as algorithms that will be used in the data analytics.

In simple terms, preprocessing is used to perform data operation to translate data into a fixed data format before providing data to algorithms or tools. The data analytics process will then be initiated with this formatted data as the input.

In case of Big Data, the datasets need to be formatted and uploaded to Hadoop Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers in Hadoop clusters.

Performing analytics over data

After data is available in the required format for data analytics algorithms, data analytics operations will be performed. The data analytics operations are performed for discovering meaningful information from data to take better decisions towards business with data mining concepts. It may either use descriptive or predictive analytics for business intelligence.

Analytics can be performed with various machine learning as well as custom algorithmic concepts, such as regression, classification, clustering, and model-based recommendation. For Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to theMapReduce job which is to be run over Hadoop clusters. These models need to be further evaluated as well as improved by various evaluation stages of machine learning concepts. Improved or optimized algorithms can provide better insights.

Visualizing data

Data visualization is used for displaying the output of data analytics. Visualization is an interactive way to represent the data insights. This can be done with various data visualization softwares as well as R packages. R has a variety of packages for the visualization of datasets. They are as follows:

Some popular examples of visualization with R are as follows:

  • Plots for facet scales (ggplot): The following figure shows the comparison of males and females with different measures; namely, education, income, life expectancy, and literacy, using ggplot:

  • Dashboard charts: This is an rCharts type. Using this we can build interactive animated dashboards with R.

Understanding data analytics problems

In this section, we have included three practical data analytics problems with various stages of data-driven activity with R and Hadoop technologies. These data analytics problem definitions are designed such that readers can understand how Big Data analytics can be done with the analytical power of functions, packages of R, and the computational powers of Hadoop.

The data analytics problem definitions are as follows:

  • Exploring the categorization of web pages
  • Computing the frequency of changes in the stock market
  • Predicting the sale price of a blue book for bulldozers (case study)

Exploring web pages categorization

This data analytics problem is designed to identify the category of a web page of a website, which may categorized popularity wise as high, medium, or low (regular), based on the visit count of the pages. While designing the data requirement stage of the data analytics life cycle, we will see how to collect these types of data from Google Analytics.

Identifying the problem

As this is a web analytics problem, the goal of the problem is to identify the importance of web pages designed for websites. Based on this information, the content, design, or visits of the lower popular pages can be improved or increased.

Designing data requirement

In this section, we will be working with data requirement as well as data collection for this data analytics problem. First let's see how the requirement for data can be achieved for this problem.

Since this is a web analytics problem, we will use Google Analytics data source. To retrieve this data from Google Analytics, we need to have an existent Google Analytics account with web traffic data stored on it. To increase the popularity, we will require the visits information of all of the web pages. Also, there are many other attributes available in Google Analytics with respect to dimensions and metrics.

Understanding the required Google Analytics data attributes

The header format of the dataset to be extracted from Google Analytics is as follows:

date, source, pageTitle, pagePath

  • date: This is the date of the day when the web page was visited
  • source: This is the referral to the web page
  • pageTitle: This is the title of the web page
  • pagePath: This is the URL of the web page

Collecting data

As we are going to extract the data from Google Analytics, we need to use RGoogleAnalytics, which is an R library for extracting Google Analytics datasets within R. To extract data, you need this plugin to be installed in R. Then you will be able to use its functions.

The following is the code for the extraction process from Google Analytics:

# Loading the RGoogleAnalytics library require("RGoogleAnalyics") # Step 1. Authorize your account and paste the access_token query <- QueryBuilder() access_token <- query$authorize() # Step 2. Create a new Google Analytics API object ga <- RGoogleAnalytics() # To retrieve profiles from Google Analytics ga.profiles <- ga$GetProfileData(access_token) # List the GA profiles ga.profiles # Step 3. Setting up the input parameters profile <- ga.profiles$id[3] startdate <- "2010-01-08" enddate <- "2013-08-23" dimension <- "ga:date,ga:source,ga:pageTitle,ga:pagePath" metric <- "ga:visits" sort <- "ga:visits" maxresults <- 100099 # Step 4. Build the query string, use the profile by setting its index value query$Init(start.date = startdate, end.date = enddate, dimensions = dimension, metrics = metric, max.results = maxresults, table.id = paste("ga:",profile,sep="",collapse=","), access_token=access_token) # Step 5. Make a request to get the data from the API ga.data <- ga$GetReportData(query) # Look at the returned data head(ga.data) write.csv(ga.data,"webpages.csv", row.names=FALSE)

Preprocessing data

Now, we have the raw data for Google Analytics available in a CSV file. We need to process this data before providing it to the MapReduce algorithm.

There are two main changes that need to be performed into the dataset:

  • Query parameters needs to be removed from the column pagePath as follows:

    pagePath <- as.character(data$pagePath) pagePath <- strsplit(pagePath,"\\?") pagePath <- do.call("rbind", pagePath) pagePath <- pagePath [,1]

  • The new CSV file needs to be created as follows:

    data <- data.frame(source=data$source, pagePath=d,visits =) write.csv(data, "webpages_mapreduce.csv" , row.names=FALSE)

Performing analytics over data

To perform the categorization over website pages, we will build and run the MapReduce algorithm with R and Hadoop integration.

In case of chaining MapReduce jobs, multiple Mappers and Reducers can communicate in such a way that the output of the first job will be assigned to the second job as input. The MapReduce execution sequence is described in the following diagram:

Chaining MapReduce

Now let's start with the programming task to perform analytics:

  1. Initialize by setting Hadoop variables and loading the rmr2and rhdfs packages of the RHadoop libraries:

    # setting up the Hadoop variables need by RHadoop Sys.setenv(HADOOP_HOME="/usr/local/hadoop/") Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop") # Loading the RHadoop libraries rmr2 and rhdfs library(rmr2) library(rhdfs) # To initializing hdfs hdfs.init()

  2. Upload the datasets to HDFS:

    # First uploading the data to R console, webpages <- read.csv("/home/vigs/Downloads/webpages_mapreduce. csv") # saving R file object to HDFS, webpages.hdfs <- to.dfs(webpages)

Now we will see the development of Hadoop MapReduce job 1 for these analytics. We will divide this job into Mapper and Reducer. Since, there are two MapReduce jobs; there will be two Mappers and Reducers. Also note that here we need to create only one file for both the jobs with all Mappers and Reducers. Mapper and Reducer will be established by defining their separate functions.

Let's see MapReduce job 1.

  • Mapper 1: The code for this is as follows:

    mapper1 <- function(k,v) ģ # To storing pagePath column data in to key object key <- v[2] # To store visits column data into val object Val <- v[3] # emitting key and value for each row keyval(key, val) ĥ totalvisits <- sum(webpages$visits)

  • Reducer 1: The code for this is as follows:

    reducer1 <- function (k,v) ģ # Calculating percentage visits for the specific URL per <- (sum(v)/totalvisits)*100 # Identify the category of URL if (per <33 ) ģ val <- "low" ĥ if (per >33 && per < 67) ģ val <- "medium" ĥ if (per > 67) ģ val <- "high" ĥ # emitting key and values keyval(k, val) ĥ

  • Output of MapReduce job 1: The intermediate output for the information is shown in the following screenshot:

The output in the preceding screenshot is only for information about the output of this MapReduce job 1. This can be considered an intermediate output where only 100 data rows have been considered from the whole dataset for providing output. In these rows, 23 URLs are unique; so the output has provided 23 URLs.

Let's see Hadoop MapReduce job 2:

  • Mapper 2: The code for this is as follows:

    #Mapper: mapper2 <- function(k, v) ģ # Reversing key and values and emitting them keyval(v,k) ĥ

  • Reducer 2: The code for this is as follows:

    key <- NA val <- NULL # Reducer: reducer2 <- function(k, v) ģ # for checking whether key-values are already assigned or not. if(is.na(key)) ģ key <- k val <- v ĥ else ģ if(key==k) ģ val <- c(val,v) ĥ elseģ key <- k val <- v ĥ ĥ # emitting key and list of values keyval(key,list(val)) ĥ

Before executing the MapReduce job, please start all the Hadoop daemons and check the HDFS connection via the hdfs.init()method. If your Hadoop daemons have not been started, you can start them by $hduser@ubuntu :~ $HADOOP_HOME/bin/start-all.sh.

Once we are ready with the logic of the Mapper and Reducer, MapReduce jobs can be executed by the MapReduce method of the rmr2 package. Here we have developed multiple MapReduce jobs, so we need to call the mapreduce function within the mapreduce function with the required parameters.

The command for calling a chained MapReduce job is seen in the following figure:

The following is the command for retrieving the generated output from HDFS:


While executing Hadoop MapReduce, the execution log output will be printed over the terminal for the purpose of monitoring. We will understand MapReduce job 1 and MapReduce job 2 by separating them into different parts.

The details for MapReduce job 1 is as follows:

  • Tracking the MapReduce job metadata: With this initial portion of log, we can identify the metadata for the Hadoop MapReduce job. We can also track the job status with the web browser by calling the given Tracking URL.

  • Tracking status of Mapper and Reducer tasks: With this portion of log, we can monitor the status of the Mapper or Reducer task being run on Hadoop cluster to get details such as whether it was a success or a failure.

  • Tracking HDFS output location: Once the MapReduce job is completed, its output location will be displayed at the end of logs.

For MapReduce job 2.

  • Tracking the MapReduce job metadata: With this initial portion of log, we can identify the metadata for the Hadoop MapReduce job. We can also track the job status with the web browser by calling the given Tracking URL.

  • Tracking status of the Mapper and Reducer tasks: With this portion of log, we can monitor the status of the Mapper or Reducer tasks being run on the Hadoop cluster to get the details such as whether it was successful or failed.

  • Tracking HDFS output location: Once the MapReduce job is completed, its output location will be displayed at the end of the logs.

The output of this chained MapReduce job is stored at an HDFS location, which can be retrieved by the command:


The response to the preceding command is shown in the following figure (output only for the top 1000 rows of the dataset):

Visualizing data

We collected the web page categorization output using the three categories. I think the best thing we can do is simply list the URLs. But if we have more information, such as sources, we can represent the web pages as nodes of a graph, colored by popularity with directed edges when users follow the links. This can lead to more informative insights.


In this article, we learned how to perform Big Data analytics with various data driven activities over an R and Hadoop integrated environment.

Resources for Article:

Further resources on this subject:

You've been reading an excerpt of:

Big Data Analytics with R and Hadoop

Explore Title