Building Spark Streaming Applications with Kafka

We have gone through all the components of Apache Kafka and different APIs that can be used to develop an application which can use Kafka. In the previous chapter, we learned about Kafka producer, brokers, and Kafka consumers, and different concepts related to best practices for using Kafka as a messaging system.

In this chapter, we will cover Apache Spark, which is distributed in memory processing engines and then we will walk through Spark Streaming concepts and how we can integrate Apache Kafka with Spark.

In short, we will cover the following topics:

Introduction to Spark
Internals of Spark such as RDD
Spark Streaming
Receiver-based approach (Spark-Kafka integration)
Direct approach (Spark-Kafka integration)
Use case (Log processing)

Introduction to Spark

Apache Spark is distributed in-memory data processing system. It provides rich set of API in Java, Scala, and Python. Spark API can be used to develop applications which can do batch and real-time data processing and analytics, machine learning, and graph processing of huge volumes of data on a single clustering platform.

Spark development was started in 2009 by a team at Berkeley's AMPLab for improving the performance of MapReduce framework.

MapReduce is another distributed batch processing framework developed by Yahoo in context to Google research paper.

What they found was that an application which involves an iterative approach to solving certain problems can be improvised by reducing disk I/O. Spark allows us to cache a large set of data in memory and applications which uses iterative approach of transformation can use benefit of caching to...

Spark Streaming

Spark Streaming is built on top of Spark core engine and can be used to develop a fast, scalable, high throughput, and fault tolerant real-time system. Streaming data can come from any source, such as production logs, click-stream data, Kafka, Kinesis, Flume, and many other data serving systems.
Spark streaming provides an API to receive this data and apply complex algorithms on top of it to get business value out of this data. Finally, the processed data can be put into any storage system. We will talk more about Spark Streaming integration with Kafka in this section.

Basically, we have two approaches to integrate Kafka with Spark and we will go into detail on each:

Receiver-based approach
Direct approach

The receiver-based approach is the older way of doing integration. Direct API integration provides lots of advantages over the receiver-based approach...

Use case log processing - fraud IP detection

This section will cover a small use case which uses Kafka and Spark Streaming to detect a fraud IP, and the number of times the IP tried to hit the server. We will cover the use case in the following:

Producer: We will use Kafka Producer API, which will read a log file and publish records to Kafka topic. However, in a real case, we may use Flume or producer application, which directly takes a log record on a real-time basis and publish to Kafka topic.
Fraud IPs list: We will maintain a list of predefined fraud IP range which can be used to identify fraud IPs. For this application we are using in memory IP list which can be replaced by fast key based lookup, such as HBase.
Spark Streaming: Spark Streaming application will read records from Kafka topic and will detect IPs and domains which are suspicious.

...

Producer

You can use IntelliJ or Eclipse to build a producer application. This producer reads a log file taken from an Apache project which contains detailed records like:

64.242.88.10 - - [08/Mar/2004:07:54:30 -0800] "GET /twiki/bin/edit/Main/Unknown_local_recipient_reject_code?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

You can have just one record in the test file and the producer will produce records by generating random IPs and replace it with existing. So, we will have millions of distinct records with unique IP addresses.

Record columns are separated by space delimiters, which we change to commas in producer. The first column represents the IP address or the domain name which will be used to detect whether the request was from a fraud client. The following is the Java Kafka producer which remembers logs.

...

Summary

In this chapter, we learned about Apache Spark, its architecture, and Spark ecosystem in brief. Our focus was on covering different ways we can integrate Kafka with Spark and their advantages and disadvantages. We also covered APIs for the receiver-based approach and direct approach. Finally, we covered a small use case about IP fraud detection through the log file and lookup service. You can now create your own Spark streaming application.
In the next chapter, we will cover another real-time streaming application, Apache Heron (successor of Apache Storm). We will cover how Apache Heron is different from Apache Spark and when to use which one.