Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
R High Performance Programming

You're reading from  R High Performance Programming

Product type Book
Published in Jan 2015
Publisher
ISBN-13 9781783989263
Pages 176 pages
Edition 1st Edition
Languages

Table of Contents (17) Chapters

R High Performance Programming
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Understanding R's Performance – Why Are R Programs Sometimes Slow? Profiling – Measuring Code's Performance Simple Tweaks to Make R Run Faster Using Compiled Code for Greater Speed Using GPUs to Run R Even Faster Simple Tweaks to Use Less RAM Processing Large Datasets with Limited RAM Multiplying Performance with Parallel Computing Offloading Data Processing to Database Systems R and Big Data Index

Chapter 10. R and Big Data

We have come to the final chapter of this book where we will go to the very limits of large-scale data processing. The term Big Data has been used to describe the ever growing volume, velocity, and variety of data being generated on the Internet in connected devices and many other places. Many organizations now have massive datasets that measure in petabytes (one petabyte is 1,048,576 gigabytes), more than ever before. Processing and analyzing Big Data is extremely challenging for traditional data processing tools and database architectures.

In 2005, Doug Cutting and Mike Cafarella at Yahoo! developed Hadoop, based on earlier work by Google, to address these challenges. They set out to develop a new data platform to process, index, and query billions of web pages efficiently. With Hadoop, the work which would have previously required very expensive supercomputers can now be done on large clusters of inexpensive standard servers. As the volume of data grows, more...

Understanding Hadoop


Before we learn how to use Hadoop (for more information refer to http://hadoop.apache.org/) and related tools in R, we need to understand the basics of Hadoop. For our purposes, it suffices to know that Hadoop comprises two key components: the Hadoop Distributed File System (HDFS) and the MapReduce framework to execute data processing tasks. Hadoop includes many other components for task scheduling, job management, and others, but we shall not concern ourselves with those in this book.

HDFS, as the name suggests, is a virtual filesystem that is distributed across a cluster of servers. HDFS stores files in blocks, with a default block size of 128 MB. For example, a 1 GB file is split into eight blocks of 128 MB, which are distributed to different servers in the cluster. Furthermore, to prevent data loss due to server failure, the blocks are replicated. By default, they are replicated three times—there are three copies of each block of data in the cluster, and each copy...

Setting up Hadoop on Amazon Web Services


There are many ways to set up a Hadoop cluster. We can install Hadoop on a single server in pseudo-distributed mode to simulate a cluster, or on an actual cluster of servers, or virtual machines in fully distributed mode. There are also several distributions of Hadoop available from the vanilla open source version provided by the Apache Foundation to commercial distributions such as Cloudera, Hortonworks, and MapR. Covering all the different ways of setting up Hadoop is beyond the scope of this book. We instead provide instructions for one way to set up Hadoop and other relevant tools for the purpose of the examples in this chapter. If you are using an existing Hadoop cluster or setting up one in a different way, you might have to modify some of the steps.

Note

Because Hadoop and its associated tools are mostly developed for Linux/Unix based operating systems, the code in this chapter will probably not work on Windows. If you are a Windows user, follow...

Processing large datasets in batches using Hadoop


Batch processing is the most basic type of task that HDFS and MapReduce can perform. Similar to the data parallel algorithms in Chapter 8, Multiplying Performance with Parallel Computing, the master node sends a set of instructions to the worker nodes, which execute the instructions on the blocks of data stored on them. The results are then written to the disk in HDFS.

When an aggregate result is required, both the map and reduce steps are performed on the data. For example, in order to compute the mean of a distributed dataset, the mappers on the worker nodes first compute the sum and number of elements in each local chunk of data. The reducers then add up all these results to compute the global mean.

At other times, only the map step is performed when aggregation is not required. This is common in data transformation or cleaning operations where the data is simply being transformed form one format to another. One example of this is extracting...

Summary


In this chapter, we learned how to set up a Hadoop cluster on Amazon Elastic MapReduce, and how to use the RHadoop family of packages in order to analyze data in HDFS using MapReduce. We saw how the performance of the MapReduce task improves dramatically as more servers are added to the Hadoop cluster, but the performance eventually reaches a limit due to Amdahl's law (Chapter 8, Multiplying Performance with Parallel Computing).

Hadoop and its ecosystem of tools is rapidly evolving. Other tools are being actively developed to make Hadoop perform even better. For example, Apache Spark (http://spark.apache.org/) provides Resilient Distributed Datasets (RDDs) that store data in memory across a Hadoop cluster. This allows data to be read from HDFS once and to be used many times in order to dramatically improve the performance of interactive tasks like data exploration and iterative algorithms like gradient descent or k-means clustering. Another example is Apache Storm (http://storm.incubator...

lock icon The rest of the chapter is locked
You have been reading a chapter from
R High Performance Programming
Published in: Jan 2015 Publisher: ISBN-13: 9781783989263
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}