Reader small image

You're reading from  Mastering Hadoop 3

Product typeBook
Published inFeb 2019
Reading LevelExpert
PublisherPackt
ISBN-139781788620444
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Chanchal Singh
Chanchal Singh
author image
Chanchal Singh

Chanchal Singh has over half decades experience in Product Development and Architect Design. He has been working very closely with leadership team of various companies including directors ,CTO's and Founding members to define technical road-map for company.He is the Founder and Speaker at meetup group Big Data and AI Pune MeetupExperience Speaks. He is Co-Author of Book Building Data Streaming Application with Apache Kafka. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. His Linkedin Profile can be found at with the username Chanchal Singh.
Read more about Chanchal Singh

Manish Kumar
Manish Kumar
author image
Manish Kumar

Manish Kumar works as Director of Technology and Architecture at VSquare. He has over 13 years' experience in providing technology solutions to complex business problems. He has worked extensively on web application development, IoT, big data, cloud technologies, and blockchain. Aside from this book, Manish has co-authored three books (Mastering Hadoop 3, Artificial Intelligence for Big Data, and Building Streaming Applications with Apache Kafka).
Read more about Manish Kumar

View More author details
Right arrow

Chapter 9. Real-Time Stream Processing in Hadoop

All industries have started adopting big data technology, as they have seen the advantages that companies are gaining after implementing it into their existing business model. Traditionally, companies were more focused on batch job implementation, and there has always been a lag of several minutes, or sometimes hours, between the arrival of data and it being displayed to the user. This leads to a delay in decision making, which in turn leads to revenue loss. This is where real-time analytics comes into the picture.

Real-time analytics is a methodology in which data is processed immediately after the system receives it and processed data gets available for use. Spark Streaming helps in achieving such objectives veryefficiently. This chapter will cover a brief introduction to the following topics:

  • Spark Streaming
  • Integration of Apache Kafka with Apache Spark Streaming
  • Common stream data patterns
  • Streaming design considerations
  • Case studies

This chapter...

Technical requirements


You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter09

 

 

Check out the following video to see the code in action:http://bit.ly/2T3yYfz

What are streaming datasets?


Streaming datasets are about doing data processing, not on bounded data, but on unbounded data. Typical datasets are bounded. That means they are complete. At the very least, you will process data as if it were complete. Realistically, we know that there will always be new data, but as far as data processing is concerned, we will treat it as if it were a complete dataset. In the case of bounded data, data processing is done in phases and until and unless one phase is complete, other phases of data processing do not start. Another way to think about bounded data processing is that we will be done analyzing the data before new data comes in. Bounded datasets are finite in size. The following diagram represents how bounded data is processed using a typical MapReduce batch processing engine:

On the other hand, if you have an unbounded dataset (also known as an infinite dataset), it is never complete; there is always new data coming in, and typically, data is coming...

Stream data ingestion


Data ingestion represents a mechanism in which data is moved from a specific type of source to destination storage, where it can be further used for advanced analytics. Where there are very large data volumes, data is generally streamed to the destination storage, but only on the condition that the source and destination systems are capable of handling continuous streams of data. Stream data ingestion can be of one of two types: one is event-based and another one uses message queues.

Flume event-based data ingestion

Flume is a highly available distributed system that is used for streaming data ingestion. It collects, aggregates, and processes streaming data on the fly and stores it on disk for reliability. The following diagram shows the flume architecture:

The preceding diagram represents the architecture components of the flume components, the details of which are mentioned in the following points:

  • Flume Sources: These grasp events from external sources and pass them...

Common stream data processing patterns


In this section, we will talk about various processing patterns for unbounded data. Unbounded data patterns differ from bounded or fixed width data. As with every data stream, the context in which old records were processed changes. Therefore, stream processing is continuous and only true at a given time. In this section, we will cover some of the patterns common to any type of stream processing. Let's look at them one by one.

 

Unbounded data batch processing

You can always process unbounded data in batch mode. You can achieve this by slicing or converting unbounded data to bounded data. A common technique for performing that is called windowing or tumbling windowing. In this process, unbounded data is processed in a window of fixed length, mostly separated by a time frame, repeatedly. The following diagram shows batch stream processing windowing:

Another approach to batch unbounded data processing is using sessions. The following diagram represents how...

Streaming design considerations


Streaming is always an important pillar for large-scale organizations. More and more organizations rely on a massive data pool, and they have a need for faster actionable insights. You should understand the long-lasting profitability impact of timely data and appropriate actions based on such timely insights into data. In addition to in-time activities, streaming opens up channels to capture massive, unbounded data from various business groups throughout an organization. This section focuses on the factors that should be taken into account when designing a streaming application. The end results of such designs are driven by the business objectives of the organization.

Latency

The processing of incoming data from several sources and the production of an immediate result is one of the fundamental features of any streaming application. The initial considerations for the desired feature are latency and throughput. In other words, latency and throughput are measured...

Micro-batch processing case study


This section covers a small case study that is used to detect an IP default with Kafka and Spark Streaming, and the IP has attempted to hit the server many times. We will cover the following use cases:

  • Producer: The Kafka producer API will be used to read a log file and publish documents on the topic of Kafka. In a real case, however, we could use the flume or producer application, which records in real time directly and publishes on Kafka.
  • Fraud IPs list: We will keep a list of predefined IP frauds to identify the IPs for fraud. We use an in-memory IP list for this application, which can be substituted by fast key-based searching, such as HBase.
  • Spark Streaming: Spark Streaming applications can read Kafka records and detect suspicious IPs and domains.

Maven is a tool for building and managing projects and we will build this project using Maven. Eclipse or IntelliJ are recommended for project creation. Add to your pom.xml the following adjustments and plugins...

Real-time processing case study


In this section, for the IP fraud detection case mentioned in the preceding section, we will use Apache Storm for the same log processing. Apache Storm is used for very sensitive applications in which even a 1 second delay could cause enormous losses. There are many enterprises that use Storm to detect fraud, develop recommendation engines, trigger suspicious activity, and so on. It uses Zookeeper for coordination purposes and maintains significant metadata information. Apache Storm is stateless. It is a distributed real-time processing framework that can handle one event at a time with the processing of millions of records per second per node. Streaming data can be limited or unlimited; Storm can reliably process it in both situations. The Maven app is as follows: 

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http...

Summary


In this chapter, we have learned some basics of stream processing, including stream data ingestion and some of the stream processing patterns. We also had a look at micro-batch stream processing using Spark Streaming and real-time processing using Storm processing engines.

In the next chapter, we will learn about machine learning in Hadoop.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019Publisher: PacktISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Chanchal Singh

Chanchal Singh has over half decades experience in Product Development and Architect Design. He has been working very closely with leadership team of various companies including directors ,CTO's and Founding members to define technical road-map for company.He is the Founder and Speaker at meetup group Big Data and AI Pune MeetupExperience Speaks. He is Co-Author of Book Building Data Streaming Application with Apache Kafka. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. His Linkedin Profile can be found at with the username Chanchal Singh.
Read more about Chanchal Singh

author image
Manish Kumar

Manish Kumar works as Director of Technology and Architecture at VSquare. He has over 13 years' experience in providing technology solutions to complex business problems. He has worked extensively on web application development, IoT, big data, cloud technologies, and blockchain. Aside from this book, Manish has co-authored three books (Mastering Hadoop 3, Artificial Intelligence for Big Data, and Building Streaming Applications with Apache Kafka).
Read more about Manish Kumar