Chapter 10. R and Big Data
We have come to the final chapter of this book where we will go to the very limits of large-scale data processing. The term Big Data has been used to describe the ever growing volume, velocity, and variety of data being generated on the Internet in connected devices and many other places. Many organizations now have massive datasets that measure in petabytes (one petabyte is 1,048,576 gigabytes), more than ever before. Processing and analyzing Big Data is extremely challenging for traditional data processing tools and database architectures.
In 2005, Doug Cutting and Mike Cafarella at Yahoo! developed Hadoop, based on earlier work by Google, to address these challenges. They set out to develop a new data platform to process, index, and query billions of web pages efficiently. With Hadoop, the work which would have previously required very expensive supercomputers can now be done on large clusters of inexpensive standard servers. As the volume of data grows, more...
Before we learn how to use Hadoop (for more information refer to http://hadoop.apache.org/) and related tools in R, we need to understand the basics of Hadoop. For our purposes, it suffices to know that Hadoop comprises two key components: the Hadoop Distributed File System (HDFS) and the MapReduce framework to execute data processing tasks. Hadoop includes many other components for task scheduling, job management, and others, but we shall not concern ourselves with those in this book.
HDFS, as the name suggests, is a virtual filesystem that is distributed across a cluster of servers. HDFS stores files in blocks, with a default block size of 128 MB. For example, a 1 GB file is split into eight blocks of 128 MB, which are distributed to different servers in the cluster. Furthermore, to prevent data loss due to server failure, the blocks are replicated. By default, they are replicated three times—there are three copies of each block of data in the cluster, and each copy...
Setting up Hadoop on Amazon Web Services
There are many ways to set up a Hadoop cluster. We can install Hadoop on a single server in pseudo-distributed mode to simulate a cluster, or on an actual cluster of servers, or virtual machines in fully distributed mode. There are also several distributions of Hadoop available from the vanilla open source version provided by the Apache Foundation to commercial distributions such as Cloudera, Hortonworks, and MapR. Covering all the different ways of setting up Hadoop is beyond the scope of this book. We instead provide instructions for one way to set up Hadoop and other relevant tools for the purpose of the examples in this chapter. If you are using an existing Hadoop cluster or setting up one in a different way, you might have to modify some of the steps.
Note
Because Hadoop and its associated tools are mostly developed for Linux/Unix based operating systems, the code in this chapter will probably not work on Windows. If you are a Windows user, follow...
Processing large datasets in batches using Hadoop
Batch processing is the most basic type of task that HDFS and MapReduce can perform. Similar to the data parallel algorithms in Chapter 8, Multiplying Performance with Parallel Computing, the master node sends a set of instructions to the worker nodes, which execute the instructions on the blocks of data stored on them. The results are then written to the disk in HDFS.
When an aggregate result is required, both the map and reduce steps are performed on the data. For example, in order to compute the mean of a distributed dataset, the mappers on the worker nodes first compute the sum and number of elements in each local chunk of data. The reducers then add up all these results to compute the global mean.
At other times, only the map step is performed when aggregation is not required. This is common in data transformation or cleaning operations where the data is simply being transformed form one format to another. One example of this is extracting...
In this chapter, we learned how to set up a Hadoop cluster on Amazon Elastic MapReduce, and how to use the RHadoop family of packages in order to analyze data in HDFS using MapReduce. We saw how the performance of the MapReduce task improves dramatically as more servers are added to the Hadoop cluster, but the performance eventually reaches a limit due to Amdahl's law (Chapter 8, Multiplying Performance with Parallel Computing).
Hadoop and its ecosystem of tools is rapidly evolving. Other tools are being actively developed to make Hadoop perform even better. For example, Apache Spark (http://spark.apache.org/) provides Resilient Distributed Datasets (RDDs) that store data in memory across a Hadoop cluster. This allows data to be read from HDFS once and to be used many times in order to dramatically improve the performance of interactive tasks like data exploration and iterative algorithms like gradient descent or k-means clustering. Another example is Apache Storm (http://storm.incubator...