Reader small image

You're reading from  Apache Spark 2.x Cookbook

Product typeBook
Published inMay 2017
Reading LevelIntermediate
Publisher
ISBN-139781787127265
Edition1st Edition
Languages
Right arrow
Author (1)
Rishi Yadav
Rishi Yadav
author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav

Right arrow

Chapter 5. Spark Streaming

Spark Streaming adds the holy grail of big data processing—that is, real-time analytics—to Apache Spark. It enables Spark to ingest live data streams and provides real-time intelligence at a very low latency of a few seconds.

In this chapter, we are going to cover the following recipes:

  • WordCount using Structured Streaming
  • Diving into Structured Streaming
  • Streaming Twitter data
  • Streaming using Kafka
  • Understanding streaming challenges

Introduction


Streaming is the process of dividing continuously flowing input data into discrete units so that it can be processed easily. Familiar examples in real life are streaming video and audio content (though a user can download the full movie before he/she can watch it, a faster solution is to stream data in small chunks that start playing for the user while the rest of the data is being downloaded in the background).

Real-world examples of streaming, besides multimedia, are the processing of market feeds, weather data, electronic stock trading data, and so on. All these applications produce large volumes of data at very fast rates and require special handling of the data so that you can derive insight from the data in real time.

Streaming has a few basic concepts; it'll be better if we discuss them before we focus on Spark Streaming. The rate at which a streaming application receives data is called data rate and is expressed in the form of kilobytes per second (Kbps) or megabytes per...

WordCount using Structured Streaming


Let's start with a simple example of streaming in which in one terminal, we will type some text and the streaming application will capture it in another window.

How to do it...

  1. Start the Spark shell:
$ spark-shell  
  1. Create a DataFrame to read what's coming on port 8585:
scala> val lines = spark.readStream.format("socket").option("host","localhost").option("port",8585).load
  1. Cast the lines from DataFrame to Dataset with the String datatype and then flatten it:
scala> val words = lines.as[String].flatMap(_.split(" "))
  1.  Do the word count:
scala> val wordCounts = words.groupBy("value").count()
  1. Start the netcat server in a separate window:
$ nc -lk 8585
  1. Come back to the previous terminal and print the complete set of counts to the console every time it is updated:
scala> val query = wordCounts.writeStream.outputMode("complete").format("console").start()
  1. Now go back to the terminal where you started netcat and enter different lines, such as to be or not to be...

Taking a closer look at Structured Streaming


Structured Streaming has been introduced in various places in this chapter, but let's use this recipe to discuss some more details. Structured Streaming is essentially a stream-processing engine built on top of the Spark SQL engine. 

An alternative way to look at streaming data is to think of it as an infinite/unbounded table that gets continuously appended as new data arrives.

The four fundamental concepts in Structured Streaming are:

  • Input table: To input the table
  • Trigger: How often the table gets updated
  • Result table: The final table after every trigger update
  • Output table: What part of the result to write to storage after every trigger

A query may be interested in only newly appended data (since the last query), all of the data that has been updated (including appended obviously), or the whole table; this leads to the three output modes in Structured Streaming, as follows:

  • Append
  • Update
  • Complete

The DataFrame/Dataset API that is used for bounded tables...

Streaming Twitter data


Twitter is a famous microblogging platform. It produces a massive amount of data with around 500 million tweets sent each day. Twitter allows its data to be accessed by APIs, and that makes it the poster child of testing any big data streaming application.

In this recipe, we will see how we can live stream data in Spark using Twitter-streaming libraries. Twitter is just one source of providing streaming data to Spark and has no special status. Therefore, there are no built-in libraries for Twitter. Spark does provide some APIs to facilitate the integration with Twitter libraries, though.

An example use of a live Twitter data feed can be to find trending tweets in the last 5 minutes.

How to do it...

  1. Create a Twitter account if you do not already have one.
  2. Go to http://apps.twitter.com.
  3. Click on Create New App.
  4. Fill out the Name, Description, Website, and Callback URL fields and then click on Create your Twitter Application. You will receive a screen like this:
  1. You will reach...

Streaming using Kafka


Kafka is a distributed, partitioned, and replicated commit log service. In simple words, it is a distributed messaging server. Kafka maintains the message feed in categories called topics. An example of a topic can be the ticker symbol of a company you would like to get news about, for example, CSCO for Cisco.

Processes that produce messages are called producers and those that consume messages are called consumers. In traditional messaging, the messaging service has one central messaging server, also called the broker. Since Kafka is a distributed messaging service, it has a cluster of brokers, which functionally acts as one Kafka broker, as shown here:

For each topic, Kafka maintains the partitioned log. This partitioned log consists of one or more partitions spread across the cluster, as shown in the following figure:

Kafka borrows a lot of concepts from Hadoop and other big data frameworks. The concept of partition is very similar to the concept of InputSplit in Hadoop...

Understanding streaming challenges


There are certain challenges every streaming application faces. In this recipe, we will develop some understanding of these challenges.

Late arriving/out-of-order data

If there is leader selection in streaming challenges, it would go to the late data. This is such a streaming-specific issue that folks not very familiar with streaming find it surprising that this issue is so prevalent. 

There are two notions of time in streaming:

  • Event time: This is the time when an event actually happened, for example, measuring the temperature on a drive to an industrial site. Almost always, this event will contain this time as part of the record.
  • Processing time: This is measured by the program that processed the event, for example, if the time series IoT event is processed in the cloud, then the processing time is the time this event reached the component (like Kinesis), which is doing the processing. 

In stream-processing applications, this time lag between the event time...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x Cookbook
Published in: May 2017Publisher: ISBN-13: 9781787127265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rishi Yadav

Rishi Yadav has 19 years of experience in designing and developing enterprise applications. He is an open source software expert and advises American companies on big data and public cloud trends. Rishi was honored as one of Silicon Valley's 40 under 40 in 2014. He earned his bachelor's degree from the prestigious Indian Institute of Technology, Delhi, in 1998. About 12 years ago, Rishi started InfoObjects, a company that helps data-driven businesses gain new insights into data. InfoObjects combines the power of open source and big data to solve business challenges for its clients and has a special focus on Apache Spark. The company has been on the Inc. 5000 list of the fastest growing companies for 6 years in a row. InfoObjects has also been named the best place to work in the Bay Area in 2014 and 2015. Rishi is an open source contributor and active blogger. This book is dedicated to my parents, Ganesh and Bhagwati Yadav; I would not be where I am without their unconditional support, trust, and providing me the freedom to choose a path of my own. Special thanks go to my life partner, Anjali, for providing immense support and putting up with my long, arduous hours (yet again).Our 9-year-old son, Vedant, and niece, Kashmira, were the unrelenting force behind keeping me and the book on track. Big thanks to InfoObjects' CTO and my business partner, Sudhir Jangir, for providing valuable feedback and also contributing with recipes on enterprise security, a topic he is passionate about; to our SVP, Bart Hickenlooper, for taking the charge in leading the company to the next level; to Tanmoy Chowdhury and Neeraj Gupta for their valuable advice; to Yogesh Chandani, Animesh Chauhan, and Katie Nelson for running operations skillfully so that I could focus on this book; and to our internal review team (especially Rakesh Chandran) for ironing out the kinks. I would also like to thank Marcel Izumi for, as always, providing creative visuals. I cannot miss thanking our dog, Sparky, for giving me company on my long nights out. Last but not least, special thanks to our valuable clients, partners, and employees, who have made InfoObjects the best place to work at and, needless to say, an immensely successful organization.
Read more about Rishi Yadav