Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Scala and Spark for Big Data Analytics

You're reading from  Scala and Spark for Big Data Analytics

Product type Book
Published in Jul 2017
Publisher Packt
ISBN-13 9781785280849
Pages 796 pages
Edition 1st Edition
Languages
Concepts
Authors (2):
Md. Rezaul Karim Md. Rezaul Karim
Profile icon Md. Rezaul Karim
Sridhar Alla Sridhar Alla
Profile icon Sridhar Alla
View More author details

Table of Contents (19) Chapters

Preface Introduction to Scala Object-Oriented Scala Functional Programming Concepts Collection APIs Tackle Big Data – Spark Comes to the Party Start Working with Spark – REPL and RDDs Special RDD Operations Introduce a Little Structure - Spark SQL Stream Me Up, Scotty - Spark Streaming Everything is Connected - GraphX Learning Machine Learning - Spark MLlib and Spark ML My Name is Bayes, Naive Bayes Time to Put Some Order - Cluster Your Data with Spark MLlib Text Analytics Using Spark ML Spark Tuning Time to Go to ClusterLand - Deploying Spark on a Cluster Testing and Debugging Spark PySpark and SparkR

Time to Go to ClusterLand - Deploying Spark on a Cluster

"I see the moon like a clipped piece of silver. Like gilded bees, the stars cluster around her"

- Oscar Wilde

In the previous chapters, we have seen how to develop practical applications using different Spark APIs. However, in this chapter, we will see how Spark works in a cluster mode with its underlying architecture. Finally, we will see how to deploy a full Spark application on a cluster. In a nutshell, the following topics will be cover throughout this chapter:

  • Spark architecture in a cluster
  • Spark ecosystem and cluster management
  • Deploying Spark on a cluster
  • Deploying Spark on a standalone cluster
  • Deploying Spark on a Mesos cluster
  • Deploying Spark on YARN cluster
  • Cloud-based deployment
  • Deploying Spark on AWS

Spark architecture in a cluster

Hadoop-based MapReduce framework has been widely used for the last few years; however, it has some issues with I/O, algorithmic complexity, low-latency streaming jobs, and fully disk-based operation. Hadoop provides the Hadoop Distributed File System (HDFS) for efficient computing and storing big data cheaply, but you can only do the computations with a high-latency batch model or static data using the Hadoop-based MapReduce framework. The main big data paradigm that Spark has brought for us is the introduction of in-memory computing and caching abstraction. This makes Spark ideal for large-scale data processing and enables the computing nodes to perform multiple operations by accessing the same input data.

Spark's Resilient Distributed Dataset (RDD) model can do everything that the MapReduce paradigm can, and even more. Nevertheless, Spark...

Deploying the Spark application on a cluster

In this section, we will discuss how to deploy Spark jobs on a computing cluster. We will see how to deploy clusters in three deploy modes: standalone, YARN, and Mesos. The following figure summarizes terms that are needed to refer to cluster concepts in this chapter:

Figure 8: Terms that are needed to refer to cluster concepts (source: http://spark.apache.org/docs/latest/cluster-overview.html#glossary)

However, before diving onto deeper, we need to know how to submit a Spark job in general.

Submitting Spark jobs

Once a Spark application is bundled as either a jar file (written in Scala or Java) or a Python file, it can be submitted using the Spark-submit script located under the...

Summary

In this chapter, we discussed how Spark works in a cluster mode with its underlying architecture. You also saw how to deploy a full Spark application on a cluster. You saw how to deploy cluster for running Spark application in different cluster modes such as local, standalone, YARN, and Mesos. Finally, you saw how to configure Spark cluster on AWS using EC2 script. We believe that this chapter will help you to gain some good understanding of Spark. Nevertheless, due to page limitation, we could not cover many APIs and their underlying functionalities.

If you face any issues, please don't forget to report this to Spark user mailing list at user@spark.apache.org. Before doing so, make sure that you have subscribed to it. In the next chapter, you will see how to test and debug Spark applications.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017 Publisher: Packt ISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}