Packt+ | Advance your knowledge in tech

You're reading from Apache Spark for Data Science Cookbook

Product typeBook

Published inDec 2016

Publisher

ISBN-139781785880100

Edition1st Edition

Tools

Apache Spark Pandas

Concepts

Data Science

Author (1)

Padma Priya Chitturi

Chapter 10. Working with SparkR

In this chapter, we'll cover the following recipes:

Introduction
Installing R
Interactive analysis with the SparkR shell
Creating a SparkR standalone application from RStudio
Creating SparkR DataFrames
SparkR DataFrame operations
Applying user-defined functions in SparkR
Running SQL queries from SparkR and caching DataFrames
Machine learning with SparkR

Introduction

R is a flexible, open source, and powerful statistical programming language. It is preferred by many professional statisticians and researchers in a variety of fields. It has extensive statistical and graphical capabilities. R combines the aspects of functional and object-oriented programming. One of the key features of R is implicit looping, which yields compact, simple code and frequently leads to faster execution. It provides a command-line interpreted statistical computing environment with built-in scripting language.

R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Its key strengths are effective data handling and storage facility, and a collection of tools for data analysis. It provides a number of extensions that support data processing and machine learning tasks. However, interactive analysis in R is limited as the runtime is single-threaded and can only process datasets that fit in a single machine's memory.

The...

Installing R

In this recipe, we will see how to install R on Linux.

Getting ready…

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine.

How to do it…

Here are the steps in the installation of R:

The Comprehensive R Archive Network (CRAN) contains precompiled binary distributions of the base system and contributed packages. It also contains source code for all the platforms. Add the security key as follows:
```
sudo apt-key adv --keyserver 
       keyserver.ubuntu.com --recv-keys   
       E084DAB9
```

Add the CRAN repository to the end of /etc/apt/sources.list:

   deb https://cran.cnr.berkeley.edu/bin/linux/ubuntu trusty/

Install R as follows:

  sudo apt-get update
  sudo apt-get install r-base r-base-dev

This will install R and the recommended packages, and additional packages can be installed using install.packages("<package>"). The packages on CRAN are updated on a regular basis and the most recent versions will usually be available within a couple...

Interactive analysis with the SparkR shell

The entry point into SparkR is the SparkContext which connects the R program to a Spark Cluster. When working with the SparkR shell, SQLContext and SparkContext are already available. SparkR's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.

How to do it…

In this recipe, we’ll see how to start SparkR interactive shell using Spark 1.6.0:

Start the SparkR shell by running the following in the SparkR package directory:

      /bigdata/spark-1.6.0-bin-hadoop2.6$ ./bin/sparkR --master
      spark://192.168.0.118:7077
  R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
  Copyright (C) 2015 The R Foundation for Statistical Computing
  Platform: x86_64-pc-linux-gnu (64-bit)
  R is free software and comes with...

Creating a SparkR standalone application from RStudio

In this recipe, we'll look at the process of writing and executing a standalone application in SparkR.

Getting ready

How to do it…

In this recipe, we'll create standalone application using Spark-1.6.0 and Spark-2.0.2:

Before working with SparkR, make sure that SPARK_HOME is set in environment as follows:

   if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
   Sys.setenv(SPARK_HOME = "/home/padmac/bigdata/spark-1.6.0-bin-
       hadoop2.6")
   }

Now, load the SparkR package and invoke sparkR.init as follows:

  library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), 
      "R", "lib")))
  sc <- sparkR.init(master = "spark://192.168.0.118:7077", 
   ...

Creating SparkR DataFrames

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in R, but with rich optimizations. SparkR DataFrames scale to large datasets using the support for distributed computation in Spark. In this recipe, we'll see how to create SparkR DataFrames from different sources, such as JSON, CSV, local R DataFrames, and Hive tables.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R. Please refer to the Creating a SparkR standalone application from Rstudio recipe for details on working with the SparkR package.

How to do it…

In this recipe, we'll see how to create SparkR data frames in Spark 1.6.0 as well as Spark 2.0.2:

Use createDataFrame...

SparkR DataFrame operations

SparkR DataFrames support a number of operations to do structured data processing. In this recipe, we'll see a good number of examples, such as selection, grouping, aggregation, and so on.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

In this recipe, we'll see how to perform various operations SparkR data frames:

Let's see how to select a column from a DataFrame:

  library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), 
      "R", "lib")))
  sc <- sparkR.init(master = "local[*]", sparkEnvir =   
      list(spark.driver.memory="2g"))
  sqlContext <- sparkRSQL.init(sc)...

Applying user-defined functions in SparkR

In this recipe we'll see how to apply the functions such as dapply, gapply and lapply over the Spark DataFrame.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer the Installing R recipe for details on the installation of R and Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

In this recipe, we'll see how to apply the user defined functions available as of Spark 2.0.2.

Here is the code which applies dapply on the Spark DataFrame.

      schema <- structType(structField("eruptions", "double"),
      structField("waiting", "double"), structField("waiting_secs",  
      "double"))
      df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) },   
      schema)
     ...

Running SQL queries from SparkR and caching DataFrames

In this recipe, we'll see how to run SQL queries over SparkR DataFrames and cache the datasets.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

The following code shows how to apply SQL queries over SparkR data frames using Spark 1.6.0. As per Spark 2.0.2, the methods would remain same except that spark session is used instead of SQLContext:

Let's create a DataFrame from a JSON file. The sample JSON file people.json contains the following content:

  {"name":"Michael"}
  {"name":"Andy", "age":30}
  {"name":"Justin", "age":19}
  Here is the code snippet for creating a data...

Machine learning with SparkR

SparkR is integrated with Spark's MLlib machine learning library so that algorithms can be parallelized seamlessly without specifying manually which part of the algorithm can be run in parallel. MLlib is one of the fastest-growing machine learning libraries; hence, the ability to use R with MLlib will create a huge number of contributions to MLlib from R users. As of Spark 1.6, there is support for generalized linear models (Gaussian and binomial) over DataFrames and as per Spark 2.0.2, the algorithms such as Naive Bayes and KMeans are available.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to Installing R recipe for details on the installation of R and the Creating SparkR DataFrames recipe to get acquainted with the creation of DataFrames from a variety of data sources.

How to do it…

Here...

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark for Data Science Cookbook

Published in: Dec 2016Publisher: ISBN-13: 9781785880100

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages