Reader small image

You're reading from  Apache Spark for Data Science Cookbook

Product typeBook
Published inDec 2016
Publisher
ISBN-139781785880100
Edition1st Edition
Concepts
Right arrow
Author (1)
Padma Priya Chitturi
Padma Priya Chitturi
author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Right arrow

Chapter 10. Working with SparkR

In this chapter, we'll cover the following recipes:

  • Introduction

  • Installing R

  • Interactive analysis with the SparkR shell

  • Creating a SparkR standalone application from RStudio

  • Creating SparkR DataFrames

  • SparkR DataFrame operations

  • Applying user-defined functions in SparkR

  • Running SQL queries from SparkR and caching DataFrames

  • Machine learning with SparkR

Introduction


R is a flexible, open source, and powerful statistical programming language. It is preferred by many professional statisticians and researchers in a variety of fields. It has extensive statistical and graphical capabilities. R combines the aspects of functional and object-oriented programming. One of the key features of R is implicit looping, which yields compact, simple code and frequently leads to faster execution. It provides a command-line interpreted statistical computing environment with built-in scripting language.

R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Its key strengths are effective data handling and storage facility, and a collection of tools for data analysis. It provides a number of extensions that support data processing and machine learning tasks. However, interactive analysis in R is limited as the runtime is single-threaded and can only process datasets that fit in a single machine's memory.

The...

Installing R


In this recipe, we will see how to install R on Linux.

Getting ready…

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine.

How to do it…

Here are the steps in the installation of R:

  1. The Comprehensive R Archive Network (CRAN) contains precompiled binary distributions of the base system and contributed packages. It also contains source code for all the platforms. Add the security key as follows:

    sudo apt-key adv --keyserver 
           keyserver.ubuntu.com --recv-keys   
           E084DAB9
    
  2. Add the CRAN repository to the end of /etc/apt/sources.list:

       deb https://cran.cnr.berkeley.edu/bin/linux/ubuntu trusty/
    
  3. Install R as follows:

      sudo apt-get update
      sudo apt-get install r-base r-base-dev
    

This will install R and the recommended packages, and additional packages can be installed using install.packages("<package>"). The packages on CRAN are updated on a regular basis and the most recent versions will usually be available within a couple...

Interactive analysis with the SparkR shell


The entry point into SparkR is the SparkContext which connects the R program to a Spark Cluster. When working with the SparkR shell, SQLContext and SparkContext are already available. SparkR's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.

How to do it…

In this recipe, we’ll see how to start SparkR interactive shell using Spark 1.6.0:

  1. Start the SparkR shell by running the following in the SparkR package directory:

          /bigdata/spark-1.6.0-bin-hadoop2.6$ ./bin/sparkR --master
          spark://192.168.0.118:7077
      R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
      Copyright (C) 2015 The R Foundation for Statistical Computing
      Platform: x86_64-pc-linux-gnu (64-bit)
      R is free software and comes with...

Creating a SparkR standalone application from RStudio


In this recipe, we'll look at the process of writing and executing a standalone application in SparkR.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R.

How to do it…

In this recipe, we'll create standalone application using Spark-1.6.0 and Spark-2.0.2:

  1. Before working with SparkR, make sure that SPARK_HOME is set in environment as follows:

       if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
       Sys.setenv(SPARK_HOME = "/home/padmac/bigdata/spark-1.6.0-bin-
           hadoop2.6")
       }
    
  2. Now, load the SparkR package and invoke sparkR.init as follows:

      library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), 
          "R", "lib")))
      sc <- sparkR.init(master = "spark://192.168.0.118:7077", 
       ...

Creating SparkR DataFrames


A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in R, but with rich optimizations. SparkR DataFrames scale to large datasets using the support for distributed computation in Spark. In this recipe, we'll see how to create SparkR DataFrames from different sources, such as JSON, CSV, local R DataFrames, and Hive tables.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R. Please refer to the Creating a SparkR standalone application from Rstudio recipe for details on working with the SparkR package.

How to do it…

In this recipe, we'll see how to create SparkR data frames in Spark 1.6.0 as well as Spark 2.0.2:

  1. Use createDataFrame...

SparkR DataFrame operations


SparkR DataFrames support a number of operations to do structured data processing. In this recipe, we'll see a good number of examples, such as selection, grouping, aggregation, and so on.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

In this recipe, we'll see how to perform various operations SparkR data frames:

  1. Let's see how to select a column from a DataFrame:

      library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), 
          "R", "lib")))
      sc <- sparkR.init(master = "local[*]", sparkEnvir =   
          list(spark.driver.memory="2g"))
      sqlContext <- sparkRSQL.init(sc)...

Applying user-defined functions in SparkR


In this recipe we'll see how to apply the functions such as dapply, gapply and lapply over the Spark DataFrame.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer the Installing R recipe for details on the installation of R and Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

In this recipe, we'll see how to apply the user defined functions available as of Spark 2.0.2.

  1. Here is the code which applies dapply on the Spark DataFrame.

          schema <- structType(structField("eruptions", "double"),
          structField("waiting", "double"), structField("waiting_secs",  
          "double"))
          df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) },   
          schema)
         ...

Running SQL queries from SparkR and caching DataFrames


In this recipe, we'll see how to run SQL queries over SparkR DataFrames and cache the datasets.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

The following code shows how to apply SQL queries over SparkR data frames using Spark 1.6.0. As per Spark 2.0.2, the methods would remain same except that spark session is used instead of SQLContext:

  1. Let's create a DataFrame from a JSON file. The sample JSON file people.json contains the following content:

      {"name":"Michael"}
      {"name":"Andy", "age":30}
      {"name":"Justin", "age":19}
      Here is the code snippet for creating a data...

Machine learning with SparkR


SparkR is integrated with Spark's MLlib machine learning library so that algorithms can be parallelized seamlessly without specifying manually which part of the algorithm can be run in parallel. MLlib is one of the fastest-growing machine learning libraries; hence, the ability to use R with MLlib will create a huge number of contributions to MLlib from R users. As of Spark 1.6, there is support for generalized linear models (Gaussian and binomial) over DataFrames and as per Spark 2.0.2, the algorithms such as Naive Bayes and KMeans are available.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

Here...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi