Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Building Data Streaming Applications with Apache Kafka

You're reading from  Building Data Streaming Applications with Apache Kafka

Product type Book
Published in Aug 2017
Publisher Packt
ISBN-13 9781787283985
Pages 278 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (14) Chapters

Preface Introduction to Messaging Systems Introducing Kafka the Distributed Messaging Platform Deep Dive into Kafka Producers Deep Dive into Kafka Consumers Building Spark Streaming Applications with Kafka Building Storm Applications with Kafka Using Kafka with Confluent Platform Building ETL Pipelines Using Kafka Building Streaming Applications Using Kafka Streams Kafka Cluster Deployment Using Kafka in Big Data Applications Securing Kafka Streaming Application Design Considerations

Building Spark Streaming Applications with Kafka

We have gone through all the components of Apache Kafka and different APIs that can be used to develop an application which can use Kafka. In the previous chapter, we learned about Kafka producer, brokers, and Kafka consumers, and different concepts related to best practices for using Kafka as a messaging system.

In this chapter, we will cover Apache Spark, which is distributed in memory processing engines and then we will walk through Spark Streaming concepts and how we can integrate Apache Kafka with Spark.

In short, we will cover the following topics:

  • Introduction to Spark
  • Internals of Spark such as RDD
  • Spark Streaming
  • Receiver-based approach (Spark-Kafka integration)
  • Direct approach (Spark-Kafka integration)
  • Use case (Log processing)

Introduction to Spark 

Apache Spark is distributed in-memory data processing system. It provides rich set of API in Java, Scala, and Python. Spark API can be used to develop applications which can do batch and real-time data processing and analytics, machine learning, and graph processing of huge volumes of data on a single clustering platform.

Spark development was started in 2009 by a team at Berkeley's AMPLab for improving the performance of MapReduce framework.

MapReduce is another distributed batch processing framework developed by Yahoo in context to Google research paper.

What they found was that an application which involves an iterative approach to solving certain problems can be improvised by reducing disk I/O. Spark allows us to cache a large set of data in memory and applications which uses iterative approach of transformation can use benefit of caching to...

Spark Streaming 

Spark Streaming is built on top of Spark core engine and can be used to develop a fast, scalable, high throughput, and fault tolerant real-time system. Streaming data can come from any source, such as production logs, click-stream data, Kafka, Kinesis, Flume, and many other data serving systems.
Spark streaming provides an API to receive this data and apply complex algorithms on top of it to get business value out of this data. Finally, the processed data can be put into any storage system. We will talk more about Spark Streaming integration with Kafka in this section.

Basically, we have two approaches to integrate Kafka with Spark and we will go into detail on each:

  • Receiver-based approach
  • Direct approach

The receiver-based approach is the older way of doing integration. Direct API integration provides lots of advantages over the receiver-based approach...

Use case log processing - fraud IP detection

This section will cover a small use case which uses Kafka and Spark Streaming to detect a fraud IP, and the number of times the IP tried to hit the server. We will cover the use case in the following:

  • Producer: We will use Kafka Producer API, which will read a log file and publish records to Kafka topic. However, in a real case, we may use Flume or producer application, which directly takes a log record on a real-time basis and publish to Kafka topic.
  • Fraud IPs list: We will maintain a list of predefined fraud IP range which can be used to identify fraud IPs. For this application we are using in memory IP list which can be replaced by fast key based lookup, such as HBase.
  • Spark Streaming: Spark Streaming application will read records from Kafka topic and will detect IPs and domains which are suspicious.
...

Producer 

You can use IntelliJ or Eclipse to build a producer application. This producer reads a log file taken from an Apache project which contains detailed records like:

64.242.88.10 - - [08/Mar/2004:07:54:30 -0800] "GET /twiki/bin/edit/Main/Unknown_local_recipient_reject_code?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

You can have just one record in the test file and the producer will produce records by generating random IPs and replace it with existing. So, we will have millions of distinct records with unique IP addresses.

Record columns are separated by space delimiters, which we change to commas in producer. The first column represents the IP address or the domain name which will be used to detect whether the request was from a fraud client. The following is the Java Kafka producer which remembers logs.

...

Summary

In this chapter, we learned about Apache Spark, its architecture, and Spark ecosystem in brief. Our focus was on covering different ways we can integrate Kafka with Spark and their advantages and disadvantages. We also covered APIs for the receiver-based approach and direct approach. Finally, we covered a small use case about IP fraud detection through the log file and lookup service. You can now create your own Spark streaming application.
In the next chapter, we will cover another real-time streaming application, Apache Heron (successor of Apache Storm). We will cover how Apache Heron is different from Apache Spark and when to use which one.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Building Data Streaming Applications with Apache Kafka
Published in: Aug 2017 Publisher: Packt ISBN-13: 9781787283985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}