Reader small image

You're reading from  Fast Data Processing Systems with SMACK Stack

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781786467201
Edition1st Edition
Languages
Right arrow
Author (1)
Raúl Estrada
Raúl Estrada
author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Right arrow

Chapter 3.  The Engine - Apache Spark

In this chapter, we'll walk through the process of downloading and running Apache Spark. We'll first see how to run it in local mode on a single computer, and then we'll run it in cluster mode. We'll also see the Spark's core abstraction for data manipulation, the resilient distributed dataset (RDD). Finally we'll dive into an RDD abstraction called DStreams (or discretized streams), the core part of this chapter is Spark Streaming.

This chapter was written for the Spark newbie, but we don't focus on the data science power of Spark; this chapter is targeted at data engineering and data architecture.

In this chapter, we will learn:

  • Spark in single mode
  • Spark core concepts
  • Resilient distributed datasets
  • Spark in cluster mode
  • Spark Streaming

Spark in single mode


Although Apache Spark cluster-based installations can become a complex task, when we integrate Mesos, Kafka, and Cassandra, the installation may become an interdisciplinary topic among engineers from: databases, telecommunications, operating systems, and infrastructure.

However, it's so easy to download and install Apache Spark on a laptop in standalone mode for learning and exploration that it has made many developers and data scientists become engaged by, and married to, the platform.

This low barrier to entry makes many small businesses capable of launching pilot projects without production systems interference, without requiring the construction of complex tools, and without hiring expensive expert technicians. As previously mentioned, Spark uses big data so nobody is left out.

Apache Spark is open source software and can be downloaded freely from the Apache foundation site. Spark requires at least Java version 6 and at least Maven version 3.0.4. All dependencies on...

Spark core concepts


Now that we have Spark running in our shell, we can learn about programming in greater detail. A Spark application consists of a driver program, which is responsible for distribution of the operations among the cluster members. The driver program also distributes the data structure fragments in the cluster, and then applies operations in a distributed way.

The driver programs access the SparkContext object representing the connection to the cluster. In the shell, it's always accessed through the sc variable. To see what type sc is:

scala>sc 
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@e4b54d3 

To run operations, driver programs have a number of nodes called executors. For example, if we run a simple count() operation in a cluster, the count() operation work is distributed among all the cluster members, each on their portion of file assigned to them by the driver program.

In our examples, as we only have one machine where we run the Spark shell...

Resilient distributed datasets


The Spark soul is the resilient distributed dataset. Spark has four design goals: make in-memory (Hadoop is not in-memory) data storage, distribute in a cluster, be fault tolerant, and be fast and efficient.

Fault tolerance is achieved, in part, by applying linear operations on small data chunks. Efficiency is achieved by parallelization of operations throughout all parts of the cluster. Performance is achieved by minimizing data replication between cluster members.

A fundamental concept in Spark is that there are only two types of operations we can do on an RDD:

  • Transformations: A new RDD is created from the original; for example, mapping, filtering, union, intersection, sort, join, coalesce
  • Actions: The original RDD isn't changed; for example, count, collect, first

It's right when people say that computer science is mathematics with a costume. As we've already seen, in functional programming, functions are first-class citizens; the equivalent in mathematics is...

Spark in cluster mode


So far in this chapter we have focused on running Spark in local mode. As we mentioned, horizontal scaling is what makes Spark so sensual and powerful. You don't need software-hardware integration gurus to run clusters with Apache Spark, and you don't need to stop the organization's entire production to escalate and add more machines to your cluster.

The good news is that the same scripts that you build on your laptop on samples of a few kilobytes, can run on business clusters that handle terabytes of information. There's no need to change the code, and no need to invoke another API. All you have to do is to test again and again to be sure your model runs correctly, and then deploy the cluster.

In this section, we'll describe the runtime architecture of a distributed Spark application, and then we'll see the options we have to run a Spark application running on a cluster.

Apache Spark has its own built-in cluster standalone manager but you can run multiple cluster managers...

Spark Streaming


When studying calculus, one thing that remains clear is that life is not a discreet process, it is continuous; and life does not come in small packages, it is a continuously flowing stream.

As discussed in the first chapter, the fresher the information, the greater the benefit of the data. Many modern applications of machine-learning should be calculated in real-time.

Spark Streaming is the module for managing data flows. Much of Spark is built with the concept of RDD. Spark Streaming provides the concept of DStreams, or Discretized Streams. A DStream is a sequence of information related to time. It is very important to emphasize that an internal DStream is a sequence of RDD, hence the name discretized.

Just as RDDs have two transformations, DStreams also offer two types of operations:

  • Transformations (whose result is another DStream)
  • Output operations aimed at writing information to external systems

DStreams have many of the operations available in the RDDs, plus newer time-related...

Summary


In this chapter, we learned the key points of Apache Spark from scratch. We saw how to download, install, and test Apache Spark. We also saw how to run Spark applications; and we reviewed some Spark core concepts, such as RDD, and the RDD operations (transformations and actions).

Also, we saw how to run Apache Spark in cluster mode, how to run the driver program, and how to achieve high-availability.

Finally, we dived into Spark streaming, the stateless and stateful transformations, the output operations, how to enable it 24/7, and how to improve Spark streaming performance.

In the following chapters, we will see how Apache Spark is the glue to our stack. In each chapter we will see the relationship with this technology.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Fast Data Processing Systems with SMACK Stack
Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada