Reader small image

You're reading from  Mastering Apache Storm

Product typeBook
Published inAug 2017
Reading LevelExpert
Publisher
ISBN-139781787125636
Edition1st Edition
Languages
Right arrow
Author (1)
Ankit Jain
Ankit Jain
author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Right arrow

Chapter 8. Integration of Storm and Kafka

Apache Kafka is a high-throughput, distributed, fault-tolerant, and replicated messaging system that was first developed at LinkedIn. The use cases of Kafka vary from log aggregation, to stream processing, to replacing other messaging systems.

Kafka has emerged as one of the important components of real-time processing pipelines in combination with Storm. Kafka can act as a buffer or feeder for messages that need to be processed by Storm. Kafka can also be used as the output sink for results emitted from Storm topologies.

In this chapter, we will be covering the following topics:

  • Kafka architecture--broker, producer, and consumer
  • Installation of the Kafka cluster
  • Sharing the producer and consumer between Kafka
  • Development of Storm topology using Kafka consumer as Storm spout
  • Deployment of a Kafka and Storm integration topology

Introduction to Kafka


In this section we are going to cover the architecture of Kafka--broker, consumer, and producer.

Kafka architecture


Kafka has an architecture that differs significantly from other messaging systems. Kafka is a peer to peer system (each node in a cluster has the same role) in which each node is called a broker. The brokers coordinate their actions with the help of a ZooKeeper ensemble. The Kafka metadata managed by the ZooKeeper ensemble is mentioned in the section Sharing ZooKeeper between Storm and Kafka:

Figure 8.1: A Kafka cluster

The following are the important components of Kafka:

Producer

producer is an entity that uses the Kafka client API to publish messages into the Kafka cluster. In a Kafka broker, messages are published by the producer entity to named entities called topics. A topic is a persistent queue (data stored into topics is persisted to disk).

For parallelism, a Kafka topic can have multiple partitions. Each partition data is represented in a different file. Also, two partitions of a single topic can be allocated on a different broker, thus increasing throughput as all...

Installation of Kafka brokers


At the time of writing, the stable version of Kafka is 0.9.x.

The prerequisites for running Kafka are a ZooKeeper ensemble and Java Version 1.7 or above. Kafka comes with a convenience script that can start a single node ZooKeeper but it is not recommended to use it in a production environment. We will be using the ZooKeeper cluster we deployed in Chapter 2, Storm Deployment, Topology Development, and Topology Options.

We will see how to set up a single node Kafka cluster first and then how to add two more nodes to it to run a full-fledged, three node Kafka cluster with replication enabled.

Setting up a single node Kafka cluster

Following are the steps to set up a single node Kafka cluster:

  1. Download the Kafka 0.9.x binary distribution named kafka_2.10-0.9.0.1.tar.gz from http://apache.claz.org/kafka/0.9.0.1/kafka_2.10-0.9.0.1.tgz.
  1. Extract the archive to wherever you want to install Kafka with the following command:
tar -xvzf kafka_2.10-0.9.0.1.tgzcd kafka_2.10-0.9...

Share ZooKeeper between Storm and Kafka


We can share the same ZooKeeper ensemble between Kafka and Storm as both store the metadata inside the different znodes (ZooKeeper coordinates between the distributed processes using the shared hierarchical namespace, which is organized similarly to a standard file system. In ZooKeeper, the namespace consisting of data registers is called znodes).

We need to open the ZooKeeper client console to view the znodes (shared namespace) created for Kafka and Storm.

Go to ZK_HOME and execute the following command to open the ZooKeeper console:

> bin/zkCli.sh

Execute the following command to view the list of znodes:

> [zk: localhost:2181(CONNECTED) 0] ls /[storm, consumers, isr_change_notification, zookeeper, admin, brokers]

Here, consumers, isr_change_notification, and brokers are the znodes and the Kafka is managing its metadata information into ZooKeeper at this location.

Storm manages its metadata inside the Storm znodes in ZooKeeper.

Kafka producers and publishing data into Kafka


In this section we are writing a Kafka producer that will publish events into the Kafka topic.

Perform the following step to create the producer:

  1. Create a Maven project by using com.stormadvance as groupId and kafka-producer as artifactId.
  2. Add the following dependencies for Kafka in the pom.xml file:
<dependency> 
  <groupId>org.apache.kafka</groupId> 
  <artifactId>kafka_2.10</artifactId> 
  <version>0.9.0.1</version> 
  <exclusions> 
    <exclusion> 
      <groupId>com.sun.jdmk</groupId> 
      <artifactId>jmxtools</artifactId> 
    </exclusion> 
    <exclusion> 
      <groupId>com.sun.jmx</groupId> 
      <artifactId>jmxri</artifactId> 
    </exclusion> 
  </exclusions> 
</dependency> 
<dependency> 
  <groupId>org.apache.logging.log4j</groupId> 
  <artifactId>log4j-slf4j-impl</artifactId...

Kafka Storm integration


Now we will create a Storm topology that will consume messages from the Kafka topic new_topic and aggregate words into sentences.

The complete message flow is shown as follows:

We have already seen KafkaSampleProducer, which produces words into the Kafka broker. Now we will create a Storm topology that will read those words from Kafka to aggregate them into sentences. For this, we will have one KafkaSpout in the application that will read the messages from Kafka and two bolts, WordBolt that receive words from KafkaSpout and then aggregate them into sentences, which are then passed onto the SentenceBolt, which simply prints them on the output stream. We will be running this topology in a local mode.

Follow the steps to create the Storm topology:

  1. Create a new Maven project with groupId as com.stormadvance and artifactId as kafka-storm-topology.
  2. Add the following dependencies for Kafka-Storm and Storm in the pom.xml file:
<dependency> 
  <groupId>org.apache.storm...

Deploy the Kafka topology on Storm cluster


The deployment of Kafka and Storm integration topology on the Storm cluster is similar to the deployment of other topologies. We need to set the number of workers and the maximum spout pending Storm config and we need to use the submitTopology method of StormSubmitter to submit the topology on the Storm cluster.

Now, we need to build the topology code as mentioned in the following steps to create a JAR of the Kafka Storm integration topology:

  1. Go to project home.
  2. Execute the command:
mvn clean install

The output of the preceding command is as follows:

------------------------------------------------------------------ -----
[INFO] ----------------------------------------------------------- -----
[INFO] BUILD SUCCESS
[INFO] ----------------------------------------------------------- -----
[INFO] Total time: 58.326s
[INFO] Finished at:
[INFO] Final Memory: 14M/116M
[INFO] ----------------------------------------------------------- -----
  1. Now, copy the Kafka...

Summary


In this chapter, we learned about the basics of Apache Kafka and how to use it as part of a real-time stream processing pipeline build with Storm. We learned about the architecture of Apache Kafka and how it can be integrated into Storm processing by using KafkaSpout.

In the next chapter, we are going to cover the integration of Storm with Hadoop and YARN. We are also going to cover sample examples for this operation.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Apache Storm
Published in: Aug 2017Publisher: ISBN-13: 9781787125636
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain