Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing Systems with SMACK Stack

Product typeBook

Published inDec 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781786467201

Edition1st Edition

Languages

Scala

Tools

Mesos Apache Spark

Concepts

Data Processing

Author (1)

Raúl Estrada

Chapter 3. The Engine - Apache Spark

In this chapter, we'll walk through the process of downloading and running Apache Spark. We'll first see how to run it in local mode on a single computer, and then we'll run it in cluster mode. We'll also see the Spark's core abstraction for data manipulation, the resilient distributed dataset (RDD). Finally we'll dive into an RDD abstraction called DStreams (or discretized streams), the core part of this chapter is Spark Streaming.

This chapter was written for the Spark newbie, but we don't focus on the data science power of Spark; this chapter is targeted at data engineering and data architecture.

In this chapter, we will learn:

Spark in single mode
Spark core concepts
Resilient distributed datasets
Spark in cluster mode
Spark Streaming

Spark in single mode

Although Apache Spark cluster-based installations can become a complex task, when we integrate Mesos, Kafka, and Cassandra, the installation may become an interdisciplinary topic among engineers from: databases, telecommunications, operating systems, and infrastructure.

However, it's so easy to download and install Apache Spark on a laptop in standalone mode for learning and exploration that it has made many developers and data scientists become engaged by, and married to, the platform.

This low barrier to entry makes many small businesses capable of launching pilot projects without production systems interference, without requiring the construction of complex tools, and without hiring expensive expert technicians. As previously mentioned, Spark uses big data so nobody is left out.

Apache Spark is open source software and can be downloaded freely from the Apache foundation site. Spark requires at least Java version 6 and at least Maven version 3.0.4. All dependencies on...

Spark core concepts

Now that we have Spark running in our shell, we can learn about programming in greater detail. A Spark application consists of a driver program, which is responsible for distribution of the operations among the cluster members. The driver program also distributes the data structure fragments in the cluster, and then applies operations in a distributed way.

The driver programs access the SparkContext object representing the connection to the cluster. In the shell, it's always accessed through the sc variable. To see what type sc is:

scala>sc 
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@e4b54d3

To run operations, driver programs have a number of nodes called executors. For example, if we run a simple count() operation in a cluster, the count() operation work is distributed among all the cluster members, each on their portion of file assigned to them by the driver program.

In our examples, as we only have one machine where we run the Spark shell...

Resilient distributed datasets

The Spark soul is the resilient distributed dataset. Spark has four design goals: make in-memory (Hadoop is not in-memory) data storage, distribute in a cluster, be fault tolerant, and be fast and efficient.

Fault tolerance is achieved, in part, by applying linear operations on small data chunks. Efficiency is achieved by parallelization of operations throughout all parts of the cluster. Performance is achieved by minimizing data replication between cluster members.

A fundamental concept in Spark is that there are only two types of operations we can do on an RDD:

Transformations: A new RDD is created from the original; for example, mapping, filtering, union, intersection, sort, join, coalesce
Actions: The original RDD isn't changed; for example, count, collect, first

It's right when people say that computer science is mathematics with a costume. As we've already seen, in functional programming, functions are first-class citizens; the equivalent in mathematics is...

Spark in cluster mode

So far in this chapter we have focused on running Spark in local mode. As we mentioned, horizontal scaling is what makes Spark so sensual and powerful. You don't need software-hardware integration gurus to run clusters with Apache Spark, and you don't need to stop the organization's entire production to escalate and add more machines to your cluster.

The good news is that the same scripts that you build on your laptop on samples of a few kilobytes, can run on business clusters that handle terabytes of information. There's no need to change the code, and no need to invoke another API. All you have to do is to test again and again to be sure your model runs correctly, and then deploy the cluster.

In this section, we'll describe the runtime architecture of a distributed Spark application, and then we'll see the options we have to run a Spark application running on a cluster.

Apache Spark has its own built-in cluster standalone manager but you can run multiple cluster managers...

Spark Streaming

When studying calculus, one thing that remains clear is that life is not a discreet process, it is continuous; and life does not come in small packages, it is a continuously flowing stream.

As discussed in the first chapter, the fresher the information, the greater the benefit of the data. Many modern applications of machine-learning should be calculated in real-time.

Spark Streaming is the module for managing data flows. Much of Spark is built with the concept of RDD. Spark Streaming provides the concept of DStreams, or Discretized Streams. A DStream is a sequence of information related to time. It is very important to emphasize that an internal DStream is a sequence of RDD, hence the name discretized.

Just as RDDs have two transformations, DStreams also offer two types of operations:

Transformations (whose result is another DStream)
Output operations aimed at writing information to external systems

DStreams have many of the operations available in the RDDs, plus newer time-related...

Summary

In this chapter, we learned the key points of Apache Spark from scratch. We saw how to download, install, and test Apache Spark. We also saw how to run Spark applications; and we reviewed some Spark core concepts, such as RDD, and the RDD operations (transformations and actions).

Also, we saw how to run Apache Spark in cluster mode, how to run the driver program, and how to achieve high-availability.

Finally, we dived into Spark streaming, the stateless and stateful transformations, the output operations, how to enable it 24/7, and how to improve Spark streaming performance.

In the following chapters, we will see how Apache Spark is the glue to our stack. In each chapter we will see the relationship with this technology.

The rest of the chapter is locked

You have been reading a chapter from

Fast Data Processing Systems with SMACK Stack

Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages