Reader small image

You're reading from  Learning Bayesian Models with R

Product typeBook
Published inOct 2015
Reading LevelBeginner
PublisherPackt
ISBN-139781783987603
Edition1st Edition
Languages
Right arrow
Author (1)
Hari Manassery Koduvely
Hari Manassery Koduvely
author image
Hari Manassery Koduvely

Dr. Hari M. Koduvely is an experienced data scientist working at the Samsung R&D Institute in Bangalore, India. He has a PhD in statistical physics from the Tata Institute of Fundamental Research, Mumbai, India, and post-doctoral experience from the Weizmann Institute, Israel, and Georgia Tech, USA. Prior to joining Samsung, the author has worked for Amazon and Infosys Technologies, developing machine learning-based applications for their products and platforms. He also has several publications on Bayesian inference and its applications in areas such as recommendation systems and predictive health monitoring. His current interest is in developing large-scale machine learning methods, particularly for natural language understanding.
Read more about Hari Manassery Koduvely

Right arrow

Chapter 9. Bayesian Modeling at Big Data Scale

When we learned the principles of Bayesian inference in Chapter 3, Introducing Bayesian Inference, we saw that as the amount of training data increases, contribution to the parameter estimation from data overweighs that from the prior distribution. Also, the uncertainty in parameter estimation decreases. Therefore, you may wonder why one needs Bayesian modeling in large-scale data analysis. To answer this question, let us look at one such problem, which is building recommendation systems for e-commerce products.

In a typical e-commerce store, there will be millions of users and tens of thousands of products. However, each user would have purchased only a small fraction (less than 10%) of all the products found in the store in their lifetime. Let us say the e-commerce store is collecting users' feedback for each product sold as a rating on a scale of 1 to 5. Then, the store can create a user-product rating matrix to capture the ratings of all...

Distributed computing using Hadoop


In the last decade, tremendous progress was made in distributed computing when two research engineers from Google developed a computing paradigm called the MapReduce framework and an associated distributed filesystem called Google File System (reference 2 in the References section of this chapter). Later on, Yahoo developed an open source version of this distributed filesystem named Hadoop that became the hallmark of Big Data computing. Hadoop is ideal for processing large amounts of data, which cannot fit into the memory of a single large computer, by distributing the data into multiple computers and doing the computation on each node locally from the disk. An example would be extracting relevant information from log files, where typically the size of data for a month would be in the order of terabytes.

To use Hadoop, one has to write programs using MapReduce framework to parallelize the computing. A Map operation splits the data into multiple key-value...

RHadoop for using Hadoop from R


RHadoop is a collection of open source packages using which an R user can manage and analyze data stored in the Hadoop Distributed File System (HDFS). In the background, RHadoop will translate these as MapReduce operations in Java and run them on HDFS.

The various packages in RHadoop and their uses are as follows:

  • rhdfs: Using this package, a user can connect to an HDFS from R and perform basic actions such as read, write, and modify files.

  • rhbase: This is the package to connect to a HBASE database from R and to read, write, and modify tables.

  • plyrmr: Using this package, an R user can do the common data manipulation tasks such as the slicing and dicing of datasets. This is similar to the function of packages such as plyr or reshape2.

  • rmr2: Using this package, a user can write MapReduce functions in R and execute them in an HDFS.

Unlike the other packages discussed in this book, the packages associated with RHadoop are not available from CRAN. They can be downloaded...

Spark – in-memory distributed computing


One of the issues with Hadoop is that after a MapReduce operation, the resulting files are written to the hard disk. Therefore, when there is a large data processing operation, there would be many read and write operations on the hard disk, which makes processing in Hadoop very slow. Moreover, the network latency, which is the time required to shuffle data between different nodes, also contributes to this problem. Another disadvantage is that one cannot make real-time queries from the files stored in HDFS. For machine learning problems, during training phase, the MapReduce will not persist over iterations. All this makes Hadoop not an ideal platform for machine learning.

A solution to this problem was invented at Berkeley University's AMP Lab in 2009. This came out of the PhD work of Matei Zaharia, a Romanian born computer scientist. His paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (reference 4 in...

SparkR


Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments:

Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6")
#provide the correct path where spark downloaded folder is kept for SPARK_HOME 
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths()))

Once this is done, start RStudio and enter the following commands to start using SparkR:

>library(SparkR)
>sc <- sparkR.init(master="local")

As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release...

Linear regression using SparkR


In the following example, we will illustrate how to use SparkR for machine learning. For this, we will use the same dataset of energy efficiency measurements that we used for linear regression in Chapter 5, Bayesian Regression Models:

>library(SparkR)
>sc <- sparkR.init(master="local")
>sqlContext <- sparkRSQL.init(sc)

#Importing data
>df <- read.csv("/Users/harikoduvely/Projects/Book/Data/ENB2012_data.csv",header = T)
>#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values
>df <- df[1:768,c(1,2,3,4,5,7,9)]
>#Converting to a Spark R Dataframe
>dfsr <- createDataFrame(sqlContext,df) 
>model <- glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family = "gaussian")
 > summary(model)

Computing clusters on the cloud


In order to process large datasets using Hadoop and associated R packages, one needs a cluster of computers. In today's world, it is easy to get using cloud computing services provided by Amazon, Microsoft, and others. One needs to pay only for the amount of CPU and storage used. No need for upfront investments on infrastructure. The top four cloud computing services are AWS by Amazon, Azure by Microsoft, Compute Cloud by Google, and Bluemix by IBM. In this section, we will discuss running R programs on AWS. In particular, you will learn how to create an AWS instance; install R, RStudio, and other packages in that instance; develop and run machine learning models.

Amazon Web Services

Popularly known as AWS, Amazon Web Services started as an internal project in Amazon in 2002 to meet the dynamic computing requirements to support their e-commerce business. This grew as an infrastructure as a service and in 2006 Amazon launched two services to the world, Simple...

Other R packages for large scale machine learning


Apart from RHadoop and SparkR, there are several other native R packages specifically built for large-scale machine learning. Here, we give a brief overview of them. Interested readers should refer to CRAN Task View: High-Performance and Parallel Computing with R (reference 10 in the References section of the chapter).

Though R is single-threaded, there exists several packages for parallel computation in R. Some of the well-known packages are Rmpi (R version of the popular message passing interface), multicore, snow (for building R clusters), and foreach. From R 2.14.0, a new package called parallel started shipping with the base R. We will discuss some of its features here.

The parallel R package

The parallel package is built on top of the multicore and snow packages. It is useful for running a single program on multiple datasets such as K-fold cross validation. It can be used for parallelizing in a single machine over multiple CPUs/cores...

Exercises


  1. Revisit the classification problem in Chapter 6, Bayesian Classification Models. Repeat the same problem using the glm() function of SparkR.

  2. Revisit the linear regression problem, we did in this chapter, using SparkR. After creating the AWS instance, repeat this problem using RStudio server on AWS.

References


  1. "MapReduce Implementation of Variational Bayesian Probabilistic Matrix Factorization Algorithm". In: IEEE Conference on Big Data. pp 145-152. 2013

  2. Dean J. and Ghemawat S. "MapReduce: Simplified Data Processing on Large Clusters". Communications of the ACM 51 (1). 107-113

  3. https://github.com/jeffreybreen/tutorial-rmr2-airline/blob/master/R/1-wordcount.R

  4. Chowdhury M., Das T., Dave A., Franklin M.J., Ma J., McCauley M., Shenker S., Stoica I., and Zaharia M. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing". NSDI 2012. 2012

  5. Amazon Elastic Compute Cloud (EC2) User Guide, Kindle e-book by Amazon Web Services, updated April 9, 2014

  6. Spark documentation for AWS at http://spark.apache.org/docs/latest/ec2-scripts.html

  7. AWS documentation for Spark at http://aws.amazon.com/elasticmapreduce/details/spark/

  8. Microsoft Virtual Academy website at http://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-azure-machine-learning...

Summary


In this last chapter of the book, we covered various frameworks to implement large-scale machine learning. These are very useful for Bayesian learning too. For example, to simulate from a posterior distribution, one could run a Gibbs sampling over a cluster of machines. We learned how to connect to Hadoop from R using the RHadoop package and how to use R with Spark using SparkR. We also discussed how to set up clusters in cloud services such as AWS and how to run Spark on them. Some of the native parallelization frameworks such as parallel and foreach functions were also covered.

The overall aim of this book was to introduce readers to the area of Bayesian modeling using R. Readers should have gained a good grasp of theory and concepts behind Bayesian machine learning models. Since the examples were mainly given for the purposes of illustration, I urge readers to apply these techniques to real-world problems to appreciate the subject of Bayesian inference more deeply.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Bayesian Models with R
Published in: Oct 2015Publisher: PacktISBN-13: 9781783987603
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hari Manassery Koduvely

Dr. Hari M. Koduvely is an experienced data scientist working at the Samsung R&D Institute in Bangalore, India. He has a PhD in statistical physics from the Tata Institute of Fundamental Research, Mumbai, India, and post-doctoral experience from the Weizmann Institute, Israel, and Georgia Tech, USA. Prior to joining Samsung, the author has worked for Amazon and Infosys Technologies, developing machine learning-based applications for their products and platforms. He also has several publications on Bayesian inference and its applications in areas such as recommendation systems and predictive health monitoring. His current interest is in developing large-scale machine learning methods, particularly for natural language understanding.
Read more about Hari Manassery Koduvely