Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Spark 2

You're reading from  Learning Apache Spark 2

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785885136
Pages 356 pages
Edition 1st Edition
Languages

Table of Contents (18) Chapters

Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface
1. Architecture and Installation 2. Transformations and Actions with Spark RDDs 3. ETL with Spark 4. Spark SQL 5. Spark Streaming 6. Machine Learning with Spark 7. GraphX 8. Operating in Clustered Mode 9. Building a Recommendation System 10. Customer Churn Prediction 1. Theres More with Spark

Apache Spark cluster manager types


As discussed previously, Apache Spark currently supports three Cluster managers:

  • Standalone cluster manager
  • ApacheMesos
  • Hadoop YARN

We'll look at setting these up in much more detail in Chapter 8, Operating in Clustered Mode, which talks about the operation in a clustered mode.

Building standalone applications with Apache Spark

Until now we have used Spark for exploratory analysis, using Scala and Python shells. Spark can also be used in standalone applications that can run in Java, Scala, Python, or R. As we saw earlier, Spark shell and PySpark provide you with a SparkContext. However, when you are using an application, you need to initialize your own SparkContext. Once you have a SparkContext reference, the remaining API remains exactly the same as for interactive query analysis. After all, it's the same object, just a different context in which it is running.

The exact method of using Spark with your application differs based on your preference of language. All Spark artifacts are hosted in Maven central. You can add a Maven dependency with the following coordinates:

groupId: org.apache.spark 
artifactId: spark_core_2.10 
version: 1.6.1 

You can use Maven to build the project or alternatively use Scale/Eclipse IDE to add a Maven dependency to your project.

Note

Apache Maven is a build automation tool used primarily for Java projects. The word maven means "accumulator of knowledge" in Yiddish. Maven addresses the two core aspects of building software: first, it describes how the software is built and second, it describes its dependencies.

You can configure your IDE's to work with Spark. While many of the Spark developers use SBT or Maven on the command line, the most common IDE being used is IntelliJ IDEA. Community edition is free, and after that you can install JetBrains Scala Plugin. You can find detailed instructions on setting up either IntelliJIDEA or Eclipse to build Spark at http://bit.ly/28RDPFy.

Submitting applications

The spark submit script in Spark's bin directory, being the most commonly used method to submit Spark applications to a Spark cluster, can be used to launch applications on all supported cluster types. You will need to package all dependent projects with your application to enable Spark to distribute that across the cluster. You would need to create an assembly JAR (aka uber/fat JAR) file containing all of your code and relevant dependencies.

A spark application with its dependencies can be launched using the bin/spark-submit script. This script takes care of setting up the classpath and its dependencies, and it supports all the cluster-managers and deploy modes supported by Spark.

Figure 1.16: Spark submission template

For Python applications:

  • Instead of <application-jar>, simply pass in your .py file.
  • Add Python .zip, .egg, and .py files to the search path with - .py files.

Deployment strategies

  • Client mode: This is commonly used when your application is located near to your cluster. In this mode, the driver application is launched as a part of the spark-submit process, which acts as a client to the cluster. The input and output of the application is passed on to the console. The mode is suitable when your gateway machine is physically collocated with your worker machines, and is used in applications such as Spark shell that involve REPL. This is the default mode.
  • Cluster mode: This is useful when your application is submitted from a machine far from the worker machines (for example, locally on your laptop) to minimize network latency between the drivers and the executors. Currently only YARN supports cluster mode for Python applications. The following table shows that the combination of cluster managers, deployment managers, and usage that are not supported in Spark 2.0.0:

Cluster Manager

Deployment Mode

Application Type

Support

MESOS

Cluster

R

Not Supported

Standalone

Cluster

Python

Not Supported

Standalone

Cluster

R

Not Supported

Local

Cluster

-

Incompatible

-

Cluster

Spark-Shell

Not applicable

-

Cluster

Sql-Shell

Not Applicable

-

Cluster

Thrift Server

Not Applicable

You have been reading a chapter from
Learning Apache Spark 2
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781785885136
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}