Building Storm Applications with Kafka

In the previous chapter, we learned about Apache Spark, a near real-time processing engine which can process data in micro batches. But when it comes to very low latency applications, where seconds of delay may cause big trouble, Spark may not be a good fit for you. You would need a framework which can handle millions of records per second and you would want to process record by record, instead of processing in batches, for lower latency. In this chapter, we will learn about the real-time processing engine, Apache Storm. Storm was first designed and developed by Twitter, which later became an open source Apache project.

In this chapter, we will learn about:

Introduction to Apache Storm
Apache Storm architecture
Brief overview of Apache Heron
Integrating Apache Storm with Apache Kafka (Java/Scala example)
Use case (log processing)

...

Introduction to Apache Storm

Apache Storm is used to handle very sensitive applications where even a delay of 1 second can mean huge losses. There are many companies using Storm for fraud detection, building recommendation engines, triggering suspicious activity, and so on. Storm is stateless; it uses Zookeeper for coordinating purposes, where it maintains important metadata information.

Apache Storm is a distributed real-time processing framework which has the ability to process a single event at a time with millions of records being processed per second per node. The streaming data can be bounded or unbounded; in both situations Storm has the capability to reliably process it.

Storm cluster architecture

Storm also...

Introduction to Apache Heron

Apache Heron is the successor to Apache Storm with backward compatibility. Apache Heron
provides more power in terms of throughput, latency, and processing capability over Apache Storm as use cases in Twitter started increasing, they felt of having new stream processing engine because of the following Storm bottleneck:

Debugging: Twitter faced challenges in debugging due to code errors, hardware failures, and so on. The root cause was very difficult to detect because of no clear mapping of logical unit of computation to physical processing.
Scale on Demand: Storm requires dedicated cluster resources, which needs separate hardware resources to run Storm topology. This restricts Storm from using cluster resources efficiently and limits it to scale on demand. This also limits its ability to share cluster resources across different processing engines...

Integrating Apache Kafka with Apache Storm - Java

As discussed previously, we are now familiar with the Storm topology concept and will now look into how we can integrate Apache Storm with Apache Kafka. Apache Kafka is most widely used with Apache Storm in production applications. Let us look into different APIs available for integration:

KafkaSpout: Spout in Storm is responsible for consuming data from the source system and passing it to bolts for further processing. KafkaSpout is specially designed for consuming data from Kafka as a stream and then passing it to bolts for further processing. KafkaSpout accepts SpoutConfig, which contains information about Zookeeper, Kafka brokers, and topics to connect with.

Look at the following code:

SpoutConfig spoutConfig = new SpoutConfig(hosts, inputTopic, "/" + zkRootDir, consumerGroup);
spoutConfig.scheme = new SchemeAsMultiScheme...

Integrating Apache Kafka with Apache Storm - Scala

This section contains the Scala version of the wordcount program discussed previously.
Topology Class: Let us try the topology class with Scala:

import org.apache.Storm.Config
import org.apache.Storm.LocalCluster
import org.apache.Storm.StormSubmitter
import org.apache.Storm.kafka._
import org.apache.Storm.spout.SchemeAsMultiScheme
import org.apache.Storm.topology.TopologyBuilder

object KafkaStormWordCountTopology {

  def main(args: Array[String]): Unit = {
    val zkConnString: String = "localhost:2181"
    val topic: String = "words"
    val hosts: BrokerHosts = new ZkHosts(zkConnString)
    val kafkaSpoutConfig: SpoutConfig =
      new SpoutConfig(hosts, topic, "/" + topic, "wordcountID")
    kafkaSpoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime()
    kafkaSpoutConfig...

Use case – log processing in Storm, Kafka, Hive

We will use the same use case of IP Fraud Detection which we used in Chapter 5, Building Spark Streaming Applications with Kafka. Let us begin with the code and how it works. Copy the following classes from Chapter 5, Building Spark Streaming Applications with Kafka, into your Storm Kafka use case:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.packt</groupId>
    <artifactId>chapter6</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties...

Summary

In this chapter, we learned about Apache Storm architecture in brief and we also went through the limitations of Storm which motivated Twitter to develop Heron. We also discussed Heron architecture and its components. Later, we went through the API and an example of Storm Kafka integration. We also covered IP Fraud detection use cases and learned how to create a topology.
In the next chapter, we will learn about the Confluent Platform for Apache Kafka, which provides lots of advanced tools and features which we can use with Kafka.