Packt+ | Advance your knowledge in tech

You're reading from Big Data Analytics with Hadoop 3

Product typeBook

Published inMay 2018

PublisherPackt

ISBN-139781788628846

Edition1st Edition

Tools

Hadoop

Concepts

Big Data

Author (1)

Sridhar Alla

Chapter 7. Real-Time Analytics with Apache Spark

In this chapter, we will introduce the stream-processing model of Apache Spark, and show you how to build streaming-based, real-time analytical applications. This chapter will focus on Spark Streaming, and will show you how to process data streams using the Spark API.

More specifically, the reader will learn how to process Twitter's tweets, as well as how to process real-time data streams in several ways. Basically, the chapter will focus on the following:

A short introduction to streaming
Spark Streaming
Discretized Streams
Stateful and stateless transformations
Checkpointing
Operating with other streaming platforms (such as Apache Kafka)
Structured Streaming

Streaming

In the modern world, an increasing number of people are becoming interconnected to one another via the internet. With the advent of the smartphone, this trend has skyrocketed. Nowadays, the smartphone can be used to do many things, such as check social media, order food online, and call a cab online. We are finding ourselves more reliant on the internet than ever before, and we will only become more reliant in the future. With this development comes a massive increase in data generation. As the internet began to boom, the very nature of data processing changed. Any time one of the apps or service is accessed on the phone, real-time data processing is taking place. Because there is a lot at stake in terms of the quality of their applications, companies are forced to improve data processing, and with improvements come paradigm shifts. One paradigm that is currently being researched and used is the idea of a highly scalable, real-time (or as close to real-time as possible) processing...

Spark Streaming

Spark Streaming wasn't the first streaming architecture. Over time, multiple technologies have been developed in order to address various real-time processing needs. One of the first popular stream processor technologies was Twitter Storm, and it was used in many businesses. Spark includes the streaming library, which has grown to become the most widely used technology today. This is mainly because Spark Streaming holds some significant advantages over all of the other technologies, the most important being its integration of Spark Streaming APIs within its core API. Not only that, but Spark Streaming is also integrated with Spark ML and Spark SQL, along with GraphX. Because of all of these integrations, Spark is a powerful and versatile streaming technology.

Note that https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html has more information on Spark Streaming Flink, Heron (Twitter Storm's successor), and Samza and their various features; for example, their...

fileStream

The fileStream creates an input stream that monitors a Hadoop-compatible filesystem. It reads new files using a given key-value type and input format. Any filenames starting with . are ignored. Invoking an atomic file rename function, a filename starting with . is renamed to a usable filename that can be picked up by the fileStream and have its contents processed:

def fileStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K, V]:
ClassTag] (directory: String): InputDStream[(K, V)]

textFileStream

The textFileStream command creates an input stream that monitors a Hadoop-compatible filesystem. It reads new files, as text files with the key as Longwritable, the value as text, and the input format as TextInputFormat. Any files that have names starting with . are ignored:

def textFileStream(directory: String): Dstream[String]

binaryRecordsStream

Using binaryRecordsStream, an input stream that monitors a Hadoop-compatible filesystem is created. Any filenames starting with . are ignored...

Transformations

Transformations on DStreams are similar to those that are applicable to a Spark Core RDD. DStreams consist of RDDs, so a transformation applies to each RDD to generate a transformed RDD for each RDD, creating a transformed DStream. Each transformation creates a specified DStream derived class.

There are many DStream classes that are built for a functionalities; map transformations, window functions, reduce actions, and different InputStream types are implemented using different DStream-derived classes.

The following table showcases the possible types of transformations:

Checkpointing

As it is expected that real-time streaming applications will run for extended periods of time while remaining resilient to failure, Spark Streaming implements a mechanism called checkpointing. This mechanism tracks enough information to be able to recover from any failures. There are two types of data checkpointing:

Metadata checkpointing
Data checkpointing

Checkpointing is enabled by calling checkpoint() on the StreamingContext:

def checkpoint(directory: String)

This specifies the directory where the checkpoint data is to be stored. Note that this must be a filesystem that is fault tolerant, such as HDFS.

Once the directory for the checkpoint is set, any DStream can be checkpointed into it, based on an interval. Revisiting the Twitter example, each DStream can be checkpointed every 10 seconds:

val ssc = new StreamingContext(sc, Seconds(5))
val twitterStream = TwitterUtils.createStream(ssc, None)
val wordStream = twitterStream.flatMap(x => x.getText().split(" "))
val aggStream...

Driver failure recovery

We can achieve driver failure recovery with the help of StreamingContext.getOrCreate(). As previously mentioned, this will either initialize a StreamingContext from a checkpoint that already exists, or create a new one.

We will not implement a function called createStreamContext0, which creates a StreamingContext and sets up DStreams to interpret tweets and generate the top five most-used hashtags, using a window every 15 seconds. Instead of invoking createStreamContext() and then calling ssc.start(), we will invoke getOrCreate(), so that if a checkpoint exists, then the StreamingContext will be recreated from the data in the checkpoint Directory. If there is no such directory, or if the application is on its first run, then createStreamContext() will be invoked:

val ssc = StreamingContext.getOrCreate(checkpointDirectory,
createStreamContext _)

The following code shows how the function is defined, and how getOrCreate() can be invoked:

val checkpointDirectory = "checkpoints...

Interoperability with streaming platforms (Apache Kafka)

Spark Streaming integrates well with Apache Kafka, currently the most popular messaging platform. This integration has several approaches, and the mechanism has improved over time with regards to performance and reliability.

There are three main approaches:

Receiver-based approach
Direct Stream approach
Structured Streaming

Receiver-based

The first integration between Spark and Kafka is the receiver-based integration. In the receiver-based approach, the driver starts the receivers on the executors, which then pull data using a high-level API from the Kafka brokers. Since the events are being pulled from the Kafka brokers, the receivers update the offsets into Zookeeper, which is also used by the Kafka cluster. The important aspect here is the use of the write ahead log (WAL), which is what the receiver writes to as it collects data from Kafka. If there is a problem and the executors and receivers have to restart or are lost, the WAL can...

Handling event time and late date

Event time is the time inside the data. Spark Streaming used to define the time as the received time for DStream purposes, but for many applications that need the event time, this is not enough. For example, if you require the number of times that a hashtag appears in a tweet every minute, then you will need the time when the data was generated, not the time when Spark received the event.

The following is an extension of the previous example of Structured Streaming, listening on server port 9999. The Timestamp is now enabled as a part of the input data, so now, we can perform window operations on the unbounded table:

import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
// Creating DataFrame that represent the stream of input lines from connection
to host:port
val inputLines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.option("includeTimestamp", true)
.load()...

Fault-tolerance semantics

The exactly-once paradigm is complicated in traditional streaming that uses an external database/storage to maintain offsets. Structured Streaming is still changing, and has several challenges to conquer before it sees widespread use.

Summary

Over the course of this chapter, the concepts of the stream-processing system, Spark Streaming, DStreams in Apache Spark, DStreams, DAG and DStream lineages, and transformations and actions were covered. Additionally, window-stream processing and a practical example of processing Twitter tweets using Spark Streaming were covered. Then, the receiver-based and direct-stream approaches of data consumption were covered with regards to Kafka, and finally, the newly developing technology of Structured Streaming was covered. Currently, it aims to solve many current challenges, such as fault tolerance, the use of exactly-once semantics in the stream, and the simplification of the integration with messaging systems, such as Kafka, while maintaining flexibility and extensibility to integrate with other input stream types.

In the next chapter, we will explore Apache Flink, which is a key challenger to Spark as a computing platform.

The rest of the chapter is locked

You have been reading a chapter from

Big Data Analytics with Hadoop 3

Published in: May 2018Publisher: PacktISBN-13: 9781788628846

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Other recommended products

Related to this chapter

Learning Apache Flink

BookFeb 2017280 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Practical Predictive Analytics

This book teaches six specific steps needed to implement predictive analytics using R. It also teaches how team collaboration is critical and how it increases the chances of implementing a successful model. The book uses cases from healthcare, marketing, and government to build practical skills. Big Data is also covered, in this book, which will extend your skill sets by learning Databricks and RSpark.

BookJun 2017576 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Amazon Redshift Cookbook

The Amazon Redshift Cookbook helps you get to grips with architecting Redshift and performing database administration tasks. You'll learn techniques for building pipelines, loading data optimally, and deriving insights from this data, along with understanding how to optimize performance and costs associated with data warehouses, and build ingestion patterns with Amazon Redshift.

BookJul 2021384 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Learning Apache Apex

Applications that use and evaluate real-time streams need to take the features of the underlying processing engine into account. This is the first book about Apache Apex, teaching readers how to include the real-time streaming engine Apex in a functioning application, and which parts to add to make it performant and usable.

BookNov 2017290 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Transformation	Meaning
`map(func)`	Applies the `transformation` function to each element of the DStream and returns a new DStream.
`filter(func)`	Filters out the records of the `DStream` to return a new DStream.
`repartition(numPartitions)`	Creates more or fewer partitions to redistribute the data to change the parallelism.
`union(otherStream)`	Combines the elements in two source DStreams and returns a new DStream.
`count()`	Returns...