Reader small image

You're reading from  Big Data Analytics with Hadoop 3

Product typeBook
Published inMay 2018
PublisherPackt
ISBN-139781788628846
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sridhar Alla
Sridhar Alla
author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Right arrow

Chapter 7. Real-Time Analytics with Apache Spark

In this chapter, we will introduce the stream-processing model of Apache Spark, and show you how to build streaming-based, real-time analytical applications. This chapter will focus on Spark Streaming, and will show you how to process data streams using the Spark API.

More specifically, the reader will learn how to process Twitter's tweets, as well as how to process real-time data streams in several ways. Basically, the chapter will focus on the following:

  • A short introduction to streaming
  • Spark Streaming
  • Discretized Streams
  • Stateful and stateless transformations
  • Checkpointing
  • Operating with other streaming platforms (such as Apache Kafka)
  • Structured Streaming

Streaming


In the modern world, an increasing number of people are becoming interconnected to one another via the internet. With the advent of the smartphone, this trend has skyrocketed. Nowadays, the smartphone can be used to do many things, such as check social media, order food online, and call a cab online. We are finding ourselves more reliant on the internet than ever before, and we will only become more reliant in the future. With this development comes a massive increase in data generation. As the internet began to boom, the very nature of data processing changed. Any time one of the apps or service is accessed on the phone, real-time data processing is taking place. Because there is a lot at stake in terms of the quality of their applications, companies are forced to improve data processing, and with improvements come paradigm shifts. One paradigm that is currently being researched and used is the idea of a highly scalable, real-time (or as close to real-time as possible) processing...

Spark Streaming


Spark Streaming wasn't the first streaming architecture. Over time, multiple technologies have been developed in order to address various real-time processing needs. One of the first popular stream processor technologies was Twitter Storm, and it was used in many businesses. Spark includes the streaming library, which has grown to become the most widely used technology today. This is mainly because Spark Streaming holds some significant advantages over all of the other technologies, the most important being its integration of Spark Streaming APIs within its core API. Not only that, but Spark Streaming is also integrated with Spark ML and Spark SQL, along with GraphX. Because of all of these integrations, Spark is a powerful and versatile streaming technology.

Note that https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html has more information on Spark Streaming Flink, Heron (Twitter Storm's successor), and Samza and their various features; for example, their...

fileStream


The fileStream creates an input stream that monitors a Hadoop-compatible filesystem. It reads new files using a given key-value type and input format. Any filenames starting with . are ignored. Invoking an atomic file rename function, a filename starting with . is renamed to a usable filename that can be picked up by the fileStream and have its contents processed:

def fileStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K, V]:
ClassTag] (directory: String): InputDStream[(K, V)]

textFileStream

The textFileStream command creates an input stream that monitors a Hadoop-compatible filesystem. It reads new files, as text files with the key as Longwritable, the value as text, and the input format as TextInputFormat. Any files that have names starting with . are ignored:

def textFileStream(directory: String): Dstream[String]

binaryRecordsStream

Using binaryRecordsStream, an input stream that monitors a Hadoop-compatible filesystem  is created. Any filenames starting with . are ignored...

Transformations


Transformations on DStreams are similar to those that are applicable to a Spark Core RDD. DStreams consist of RDDs, so a transformation applies to each RDD to generate a transformed RDD for each RDD, creating a transformed DStream. Each transformation creates a specified DStream derived class.

There are many DStream classes that are built for a functionalities; map transformations, window functions, reduce actions, and different InputStream types are implemented using different DStream-derived classes.

The following table showcases the possible  types of transformations:

Checkpointing


As it is expected that real-time streaming applications will run for extended periods of time while remaining resilient to failure, Spark Streaming implements a mechanism called checkpointing. This mechanism tracks enough information to be able to recover from any failures. There are two types of data checkpointing:

  • Metadata checkpointing 
  • Data checkpointing

Checkpointing is enabled by calling checkpoint() on the StreamingContext:

def checkpoint(directory: String)

This specifies the directory where the checkpoint data is to be stored. Note that this must be a filesystem that is fault tolerant, such as HDFS.

Once the directory for the checkpoint is set, any DStream can be checkpointed into it, based on an interval. Revisiting the Twitter example, each DStream can be checkpointed every 10 seconds:

val ssc = new StreamingContext(sc, Seconds(5))
val twitterStream = TwitterUtils.createStream(ssc, None)
val wordStream = twitterStream.flatMap(x => x.getText().split(" "))
val aggStream...

Driver failure recovery


We can achieve driver failure recovery with the help of StreamingContext.getOrCreate(). As previously mentioned, this will either initialize a StreamingContext from a checkpoint that already exists, or create a new one. 

We will not implement a function called createStreamContext0, which creates a StreamingContext and sets up DStreams to interpret tweets and generate the top five most-used hashtags, using a window every 15 seconds. Instead of invoking createStreamContext() and then calling ssc.start(), we will invoke getOrCreate(), so that if a checkpoint exists, then the StreamingContext will be recreated from the data in the checkpoint Directory. If there is no such directory, or if the application is on its first run, then createStreamContext() will be invoked:

val ssc = StreamingContext.getOrCreate(checkpointDirectory,
createStreamContext _)

The following code shows how the function is defined, and how getOrCreate() can be invoked:

val checkpointDirectory = "checkpoints...

Interoperability with streaming platforms (Apache Kafka)


Spark Streaming integrates well with Apache Kafka, currently the most popular messaging platform. This integration has several approaches, and the mechanism has improved over time with regards to performance and reliability.

There are three main approaches:

  • Receiver-based approach
  • Direct Stream approach
  • Structured Streaming

Receiver-based

The first integration between Spark and Kafka is the receiver-based integration. In the receiver-based approach, the driver starts the receivers on the executors, which then pull data using a high-level API from the Kafka brokers. Since the events are being pulled from the Kafka brokers, the receivers update the offsets into Zookeeper, which is also used by the Kafka cluster. The important aspect here is the use of the write ahead log (WAL), which is what the receiver writes to as it collects data from Kafka. If there is a problem and the executors and receivers have to restart or are lost, the WAL can...

Handling event time and late date


Event time is the time inside the data. Spark Streaming used to define the time as the received time for DStream purposes, but for many applications that need the event time, this is not enough. For example, if you require the number of times that a hashtag appears in a tweet every minute, then you will need the time when the data was generated, not the time when Spark received the event. 

The following is an extension of the previous example of Structured Streaming, listening on server port 9999. The Timestamp is now enabled as a part of the input data, so now, we can perform window operations on the unbounded table:

import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// Creating DataFrame that represent the stream of input lines from connection
to host:port
val inputLines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.option("includeTimestamp", true)
.load()...

Fault-tolerance semantics


The exactly-once paradigm is complicated in traditional streaming that uses an external database/storage to maintain offsets. Structured Streaming is still changing, and has several challenges to conquer before it sees widespread use. 

Summary


Over the course of this chapter, the concepts of the stream-processing system, Spark Streaming, DStreams in Apache Spark, DStreams, DAG and DStream lineages, and transformations and actions were covered. Additionally, window-stream processing and a practical example of processing Twitter tweets using Spark Streaming were covered. Then, the receiver-based and direct-stream approaches of data consumption were covered with regards to Kafka, and finally, the newly developing technology of Structured Streaming was covered. Currently, it aims to solve many current challenges, such as fault tolerance, the use of exactly-once semantics in the stream, and the simplification of the integration with messaging systems, such as Kafka, while maintaining flexibility and extensibility to integrate with other input stream types.

In the next chapter, we will explore Apache Flink, which is a key challenger to Spark as a computing platform.

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with Hadoop 3
Published in: May 2018Publisher: PacktISBN-13: 9781788628846
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Transformation

Meaning

map(func)

Applies the transformation function to each element of the DStream and returns a new DStream.

filter(func)

Filters out the records of the DStream to return a new DStream.

repartition(numPartitions)

Creates more or fewer partitions to redistribute the data to change the parallelism.

union(otherStream)

Combines the elements in two source DStreams and returns a new DStream.

count()

Returns...