Reader small image

You're reading from  Big Data Analytics with R

Product typeBook
Published inJul 2016
Reading LevelBeginner
PublisherPackt
ISBN-139781786466457
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Simon Walkowiak
Simon Walkowiak
author image
Simon Walkowiak

Simon Walkowiak is a cognitive neuroscientist and a managing director of Mind Project Ltd a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international companies. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analytics Summer School organized by the Institute of Analytics and Data Science (IADS).
Read more about Simon Walkowiak

Right arrow

Chapter 7. Faster than Hadoop - Spark with R

In Chapter 4, Hadoop and MapReduce Framework for R, you learned about Hadoop and MapReduce frameworks that enable users to process and analyze massive datasets stored in the Hadoop Distributed File System (HDFS). We launched a multi-node Hadoop cluster to run some heavy data crunching jobs using R language which would not be otherwise achievable on an average personal computer with any of the R distributions installed. We also said that although Hadoop is extremely powerful, it is generally recommended for data that greatly exceeds the memory limitations due to its rather slow processing. In this chapter we would like to present Apache Spark engine–a faster way to process and analyze Big Data. After reading this chapter, you should be able to:

  • Understand and appreciate Spark characteristics and functionalities

  • Deploy a fully-operational, multi-node Microsoft Azure HDInsight cluster with Hadoop, Spark, and Hive fully-configured and ready to use

  • Import...

Spark for Big Data analytics


Spark is often considered as a new, faster, and more advanced engine for Big Data analytics that could soon overthrow Hadoop as the most widely used Big Data tool. In fact, there is already a visible trend for many businesses to opt for Spark rather than Hadoop in their daily data processing activities. Undoubtedly, Spark has several selling points that make it a more attractive alternative to the slightly complicated, and sometimes clunky Hadoop:

  • It's pretty fast and can reduce the processing time by up to 100 times when run in memory as compared to standard Hadoop MapReduce jobs or up to 10 times if run on disk.

  • It's a very flexible tool that can run as a standalone application, but also can be deployed on top of Hadoop and HDFS, and other distributed file systems.

  • It can use a variety of data sources from standard relational databases, through HBase, Hive, to Amazon S3 containers. It may also be launched in the cloud. In fact, in the tutorial in this chapter...

Spark with R on a multi-node HDInsight cluster


Although Spark can be deployed in single-node, standalone mode, its powerful capabilities are best fit for multi-node applications. With this in mind, we will dedicate most of this chapter to practical Big Data crunching with Spark and R on a Microsoft Azure HDInsight cluster. As you should already be familiar with the deployment process of HDInsight clusters, our Spark workflows will contain one additional twist—€”the Spark framework will process the data straight from the Hive database, which will be populated with tables from HDFS. The introduction of Hive is a useful extension of the concepts covered in Chapter 5, R with Relational Database Management Systems (RDBMSs) and Chapter 6, R with Non-Relational (NoSQL) Databases, where we discussed the connectivity of R with relational and non-relational databases. But before we can use it, we should firstly launch a new HDInsight cluster with Spark and RStudio.

Launching HDInsight with Spark and...

Summary


In this chapter we introduced you to the Apache Spark engine for fast Big Data processing. We explained how to launch a multi-node HDInsight cluster with Hadoop, Spark, and the Hive database installed and how to connect all these resources to RStudio Server.

We then used Bay Area Bike Share open data to guide you through the numerous functions of the SparkR package for data management, transformations, and analysis on data stored in Hive tables directly from the R console.

In Chapter 8, Machine Learning Methods for Big Data in R, we will explore another powerful dimension of Big Data analytics using R language: we will apply a variety of predictive analytics algorithms to large-scale data sets using the H2O platform for distributed machine learning of Big Data.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with R
Published in: Jul 2016Publisher: PacktISBN-13: 9781786466457
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Simon Walkowiak

Simon Walkowiak is a cognitive neuroscientist and a managing director of Mind Project Ltd a Big Data and Predictive Analytics consultancy based in London, United Kingdom. As a former data curator at the UK Data Service (UKDS, University of Essex) European largest socio-economic data repository, Simon has an extensive experience in processing and managing large-scale datasets such as censuses, sensor and smart meter data, telecommunication data and well-known governmental and social surveys such as the British Social Attitudes survey, Labour Force surveys, Understanding Society, National Travel survey, and many other socio-economic datasets collected and deposited by Eurostat, World Bank, Office for National Statistics, Department of Transport, NatCen and International Energy Agency, to mention just a few. Simon has delivered numerous data science and R training courses at public institutions and international companies. He has also taught a course in Big Data Methods in R at major UK universities and at the prestigious Big Data and Analytics Summer School organized by the Institute of Analytics and Data Science (IADS).
Read more about Simon Walkowiak