Packt+ | Advance your knowledge in tech

You're reading from R High Performance Programming

Product type Book

Published in Jan 2015

Publisher

ISBN-13 9781783989263

Pages 176 pages

Edition 1st Edition

Languages

Concepts

High Performance Programming

Table of Contents (17) Chapters

R High Performance Programming

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Understanding R's Performance – Why Are R Programs Sometimes Slow?

Profiling – Measuring Code's Performance

Simple Tweaks to Make R Run Faster

Using Compiled Code for Greater Speed

Using GPUs to Run R Even Faster

Simple Tweaks to Use Less RAM

Processing Large Datasets with Limited RAM

Multiplying Performance with Parallel Computing

Offloading Data Processing to Database Systems

R and Big Data

Index

Chapter 10. R and Big Data

We have come to the final chapter of this book where we will go to the very limits of large-scale data processing. The term Big Data has been used to describe the ever growing volume, velocity, and variety of data being generated on the Internet in connected devices and many other places. Many organizations now have massive datasets that measure in petabytes (one petabyte is 1,048,576 gigabytes), more than ever before. Processing and analyzing Big Data is extremely challenging for traditional data processing tools and database architectures.

In 2005, Doug Cutting and Mike Cafarella at Yahoo! developed Hadoop, based on earlier work by Google, to address these challenges. They set out to develop a new data platform to process, index, and query billions of web pages efficiently. With Hadoop, the work which would have previously required very expensive supercomputers can now be done on large clusters of inexpensive standard servers. As the volume of data grows, more...

Understanding Hadoop

Before we learn how to use Hadoop (for more information refer to http://hadoop.apache.org/) and related tools in R, we need to understand the basics of Hadoop. For our purposes, it suffices to know that Hadoop comprises two key components: the Hadoop Distributed File System (HDFS) and the MapReduce framework to execute data processing tasks. Hadoop includes many other components for task scheduling, job management, and others, but we shall not concern ourselves with those in this book.

HDFS, as the name suggests, is a virtual filesystem that is distributed across a cluster of servers. HDFS stores files in blocks, with a default block size of 128 MB. For example, a 1 GB file is split into eight blocks of 128 MB, which are distributed to different servers in the cluster. Furthermore, to prevent data loss due to server failure, the blocks are replicated. By default, they are replicated three times—there are three copies of each block of data in the cluster, and each copy...

Setting up Hadoop on Amazon Web Services

There are many ways to set up a Hadoop cluster. We can install Hadoop on a single server in pseudo-distributed mode to simulate a cluster, or on an actual cluster of servers, or virtual machines in fully distributed mode. There are also several distributions of Hadoop available from the vanilla open source version provided by the Apache Foundation to commercial distributions such as Cloudera, Hortonworks, and MapR. Covering all the different ways of setting up Hadoop is beyond the scope of this book. We instead provide instructions for one way to set up Hadoop and other relevant tools for the purpose of the examples in this chapter. If you are using an existing Hadoop cluster or setting up one in a different way, you might have to modify some of the steps.

Note

Because Hadoop and its associated tools are mostly developed for Linux/Unix based operating systems, the code in this chapter will probably not work on Windows. If you are a Windows user, follow...

Processing large datasets in batches using Hadoop

Batch processing is the most basic type of task that HDFS and MapReduce can perform. Similar to the data parallel algorithms in Chapter 8, Multiplying Performance with Parallel Computing, the master node sends a set of instructions to the worker nodes, which execute the instructions on the blocks of data stored on them. The results are then written to the disk in HDFS.

When an aggregate result is required, both the map and reduce steps are performed on the data. For example, in order to compute the mean of a distributed dataset, the mappers on the worker nodes first compute the sum and number of elements in each local chunk of data. The reducers then add up all these results to compute the global mean.

At other times, only the map step is performed when aggregation is not required. This is common in data transformation or cleaning operations where the data is simply being transformed form one format to another. One example of this is extracting...

Summary

In this chapter, we learned how to set up a Hadoop cluster on Amazon Elastic MapReduce, and how to use the RHadoop family of packages in order to analyze data in HDFS using MapReduce. We saw how the performance of the MapReduce task improves dramatically as more servers are added to the Hadoop cluster, but the performance eventually reaches a limit due to Amdahl's law (Chapter 8, Multiplying Performance with Parallel Computing).

Hadoop and its ecosystem of tools is rapidly evolving. Other tools are being actively developed to make Hadoop perform even better. For example, Apache Spark (http://spark.apache.org/) provides Resilient Distributed Datasets (RDDs) that store data in memory across a Hadoop cluster. This allows data to be read from HDFS once and to be used many times in order to dramatically improve the performance of interactive tasks like data exploration and iterative algorithms like gradient descent or k-means clustering. Another example is Apache Storm (http://storm.incubator...