Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Building Data Streaming Applications with Apache Kafka

You're reading from  Building Data Streaming Applications with Apache Kafka

Product type Book
Published in Aug 2017
Publisher Packt
ISBN-13 9781787283985
Pages 278 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (14) Chapters

Preface Introduction to Messaging Systems Introducing Kafka the Distributed Messaging Platform Deep Dive into Kafka Producers Deep Dive into Kafka Consumers Building Spark Streaming Applications with Kafka Building Storm Applications with Kafka Using Kafka with Confluent Platform Building ETL Pipelines Using Kafka Building Streaming Applications Using Kafka Streams Kafka Cluster Deployment Using Kafka in Big Data Applications Securing Kafka Streaming Application Design Considerations

Building Storm Applications with Kafka

In the previous chapter, we learned about Apache Spark, a near real-time processing engine which can process data in micro batches. But when it comes to very low latency applications, where seconds of delay may cause big trouble, Spark may not be a good fit for you. You would need a framework which can handle millions of records per second and you would want to process record by record, instead of processing in batches, for lower latency. In this chapter, we will learn about the real-time processing engine, Apache Storm. Storm was first designed and developed by Twitter, which later became an open source Apache project.

In this chapter, we will learn about:

  • Introduction to Apache Storm
  • Apache Storm architecture
  • Brief overview of Apache Heron
  • Integrating Apache Storm with Apache Kafka (Java/Scala example)
  • Use case (log processing)
...

Introduction to Apache Storm

Apache Storm is used to handle very sensitive applications where even a delay of 1 second can mean huge losses. There are many companies using Storm for fraud detection, building recommendation engines, triggering suspicious activity, and so on. Storm is stateless; it uses Zookeeper for coordinating purposes, where it maintains important metadata information.

Apache Storm is a distributed real-time processing framework which has the ability to process a single event at a time with millions of records being processed per second per node. The streaming data can be bounded or unbounded; in both situations Storm has the capability to reliably process it.

Storm cluster architecture

Storm also...

Introduction to Apache Heron

Apache Heron is the successor to Apache Storm with backward compatibility. Apache Heron
provides more power in terms of throughput, latency, and processing capability over Apache Storm as use cases in Twitter started increasing, they felt of having new stream processing engine because of the following Storm bottleneck:

  • Debugging: Twitter faced challenges in debugging due to code errors, hardware failures, and so on. The root cause was very difficult to detect because of no clear mapping of logical unit of computation to physical processing.
  • Scale on Demand: Storm requires dedicated cluster resources, which needs separate hardware resources to run Storm topology. This restricts Storm from using cluster resources efficiently and limits it to scale on demand. This also limits its ability to share cluster resources across different processing engines...

Integrating Apache Kafka with Apache Storm - Java

As discussed previously, we are now familiar with the Storm topology concept and will now look into how we can integrate Apache Storm with Apache Kafka. Apache Kafka is most widely used with Apache Storm in production applications. Let us look into different APIs available for integration:

  • KafkaSpout: Spout in Storm is responsible for consuming data from the source system and passing it to bolts for further processing. KafkaSpout is specially designed for consuming data from Kafka as a stream and then passing it to bolts for further processing. KafkaSpout accepts SpoutConfig, which contains information about Zookeeper, Kafka brokers, and topics to connect with.

Look at the following code:

SpoutConfig spoutConfig = new SpoutConfig(hosts, inputTopic, "/" + zkRootDir, consumerGroup);
spoutConfig.scheme = new SchemeAsMultiScheme...

Integrating Apache Kafka with Apache Storm - Scala

This section contains the Scala version of the wordcount program discussed previously.
Topology Class: Let us try the topology class with Scala:

import org.apache.Storm.Config
import org.apache.Storm.LocalCluster
import org.apache.Storm.StormSubmitter
import org.apache.Storm.kafka._
import org.apache.Storm.spout.SchemeAsMultiScheme
import org.apache.Storm.topology.TopologyBuilder

object KafkaStormWordCountTopology {

  def main(args: Array[String]): Unit = {
    val zkConnString: String = "localhost:2181"
    val topic: String = "words"
    val hosts: BrokerHosts = new ZkHosts(zkConnString)
    val kafkaSpoutConfig: SpoutConfig =
      new SpoutConfig(hosts, topic, "/" + topic, "wordcountID")
    kafkaSpoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime()
    kafkaSpoutConfig...

Use case – log processing in Storm, Kafka, Hive

We will use the same use case of IP Fraud Detection which we used in Chapter 5, Building Spark Streaming Applications with Kafka. Let us begin with the code and how it works. Copy the following classes from Chapter 5, Building Spark Streaming Applications with Kafka, into your Storm Kafka use case:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.packt</groupId>
<artifactId>chapter6</artifactId>
<version>1.0-SNAPSHOT</version>

<properties...

Summary

In this chapter, we learned about Apache Storm architecture in brief and we also went through the limitations of Storm which motivated Twitter to develop Heron. We also discussed Heron architecture and its components. Later, we went through the API and an example of Storm Kafka integration. We also covered IP Fraud detection use cases and learned how to create a topology.
In the next chapter, we will learn about the Confluent Platform for Apache Kafka, which provides lots of advanced tools and features which we can use with Kafka.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Building Data Streaming Applications with Apache Kafka
Published in: Aug 2017 Publisher: Packt ISBN-13: 9781787283985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}