Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

Apache Spark cluster manager types

As discussed previously, Apache Spark currently supports three Cluster managers:

Standalone cluster manager
ApacheMesos
Hadoop YARN

We'll look at setting these up in much more detail in Chapter 8, Operating in Clustered Mode, which talks about the operation in a clustered mode.

Building standalone applications with Apache Spark

Until now we have used Spark for exploratory analysis, using Scala and Python shells. Spark can also be used in standalone applications that can run in Java, Scala, Python, or R. As we saw earlier, Spark shell and PySpark provide you with a SparkContext. However, when you are using an application, you need to initialize your own SparkContext. Once you have a SparkContext reference, the remaining API remains exactly the same as for interactive query analysis. After all, it's the same object, just a different context in which it is running.

The exact method of using Spark with your application differs based on your preference of language. All Spark artifacts are hosted in Maven central. You can add a Maven dependency with the following coordinates:

groupId: org.apache.spark 
artifactId: spark_core_2.10 
version: 1.6.1

You can use Maven to build the project or alternatively use Scale/Eclipse IDE to add a Maven dependency to your project.

Note

Apache Maven is a build automation tool used primarily for Java projects. The word maven means "accumulator of knowledge" in Yiddish. Maven addresses the two core aspects of building software: first, it describes how the software is built and second, it describes its dependencies.

You can configure your IDE's to work with Spark. While many of the Spark developers use SBT or Maven on the command line, the most common IDE being used is IntelliJ IDEA. Community edition is free, and after that you can install JetBrains Scala Plugin. You can find detailed instructions on setting up either IntelliJIDEA or Eclipse to build Spark at http://bit.ly/28RDPFy.

Submitting applications

The spark submit script in Spark's bin directory, being the most commonly used method to submit Spark applications to a Spark cluster, can be used to launch applications on all supported cluster types. You will need to package all dependent projects with your application to enable Spark to distribute that across the cluster. You would need to create an assembly JAR (aka uber/fat JAR) file containing all of your code and relevant dependencies.

A spark application with its dependencies can be launched using the bin/spark-submit script. This script takes care of setting up the classpath and its dependencies, and it supports all the cluster-managers and deploy modes supported by Spark.

Figure 1.16: Spark submission template

For Python applications:

Instead of <application-jar>, simply pass in your .py file.
Add Python .zip, .egg, and .py files to the search path with - .py files.

Deployment strategies

Client mode: This is commonly used when your application is located near to your cluster. In this mode, the driver application is launched as a part of the spark-submit process, which acts as a client to the cluster. The input and output of the application is passed on to the console. The mode is suitable when your gateway machine is physically collocated with your worker machines, and is used in applications such as Spark shell that involve REPL. This is the default mode.
Cluster mode: This is useful when your application is submitted from a machine far from the worker machines (for example, locally on your laptop) to minimize network latency between the drivers and the executors. Currently only YARN supports cluster mode for Python applications. The following table shows that the combination of cluster managers, deployment managers, and usage that are not supported in Spark 2.0.0:

Cluster Manager	Deployment Mode	Application Type	Support
MESOS	Cluster	R	Not Supported
Standalone	Cluster	Python	Not Supported
Standalone	Cluster	R	Not Supported
Local	Cluster	-	Incompatible
-	Cluster	Spark-Shell	Not applicable
-	Cluster	Sql-Shell	Not Applicable
-	Cluster	Thrift Server	Not Applicable