Reader small image

You're reading from  Fast Data Processing Systems with SMACK Stack

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781786467201
Edition1st Edition
Languages
Right arrow
Author (1)
Raúl Estrada
Raúl Estrada
author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Right arrow

Chapter 5. The Broker - Apache Kafka

The aim of this chapter is to familiarize the reader with Apache Kafka and show you how to solve the consumption of millions of messages in pipeline architecture. Here we show some Scala examples to give you a foundation for the different types of implementation and integration for Kafka producers and consumers.

In addition to the explanation of the Apache Kafka Architecture and principles, we'll explore the Kafka integration with the rest of the SMACK stack, specifically with Spark. At the end of the chapter, we will learn how to administrate Apache Kafka.

This chapter has the following sections:

  • Introducing Kafka
  • Installation
  • Cluster
  • Architecture
  • Producers
  • Consumers
  • Integration
  • Administration

Introducing Kafka


Jay Kreps, the author of Apache Kafka says about the Kafka name:

I thought that since Kafka was a system optimized for writing using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

So basically there is not much of a relationship.

Apache Kafka is mainly optimized for writing (in this book when we say optimized we mean two million writes per second on a commodity cluster).

Nowadays, real-time information is continuously generated; this data needs easy ways to be delivered to multiple receivers. Most of the time, generators and consumers of information are inaccessible to each other, and here is when integration tools are required.

In the eighties, nineties and two thousands, the large software vendors (IBM, SAP, BEA, Oracle, Microsoft, Google, and so on) found a very lucrative market in the integration layer. Here we can find enterprise service buses, SOA architectures...

Installation


Go to the Apache Kafka home page: kafka.apache.org/downloads as in Figure 5-2, Apache Kafka download page.

Figure 5-2. Apache Kafka download page

The Apache Kafka current version available is 0.10.0.0 as a stable release. A major limitation with Kafka since 0.8.x is that it is not backward-compatible. So, we cannot replace this version for one prior to 0.8. Once you've downloaded the latest available release, let's proceed with the installation.

Installing Java

We need Java 1.7 or later. Download and install the latest JDK from Oracle's website: http://www.oracle.com/technetwork/java/javase/downloads/index.html .

To install in Linux (as an example):

  1. Change the file mode:
    [master@localhost opt]# chmod +x jdk-8u91-linux-x64.rpm 
    
  2. Go to the directory in which you want to perform the installation:
    [master@localhost opt]# cd <directory path name>
    
  3. Run the rpm installer with the command:
    [master@localhost java]# rpm -ivh jdk-8u91-linux-x64.rpm 
    
  4. Finally, add the environment variable...

Cluster


Now we are ready to program with the Apache Kafka publisher-subscriber messaging system. First, a few terminologies:

In Kafka, there are three types of clusters:

  • Single node - single broker
  • Single node - multiple broker
  • Multiple node - multiple broker

A Kafka cluster has five main actors:

  • Broker: The server - a Kafka cluster has one or more physical servers where each one may have one or more server processes running.Each server process is called a broker. The topics live on the broker processes.
  • Topic: The queue is a category or feed name in which messages are published by the message producers. Topics are partitioned, and each partition is represented by an ordered immutable messages sequence. The cluster has a partitioned log for each topic. Each message in the partition has a unique sequential ID called offset.
  • Producer: These publish data to topics by choosing the appropriate partition in the topic. To achieve load balancing, the message allocation to the topic partition can be done...

Architecture


Kafka was created at LinkedIn. To start with LinkedIn used Java Message Service (JMS). But when they needed more power, that is, a scalable architecture, the LinkedIn development team decided to build the project that today we know as Kafka. In 2011, Kafka was open sourced as theApache Project. Due to size constraints, in this section we'll leave the reader with some reflections on why the architecture is designed in the way it is.

The Figure 5-6, A topic with 3 partitions, shows a topic with three partitions. We can see the five Kafka components: Zookeeper, broker, topic, producer and consumer.

Figure 5-6. A topic with 3 partitions

The Kafka project goals are:

  • An API: To support the custom implementation of producers and consumers
  • Low overhead: Low network latency and low storage overhead with message persistence on disk
  • High throughput: Publishing and subscribing of millions of messages, supporting data feeds in real time
  • Distributed: Highly scalable architecture to handle low...

Producers


Producers are applications that create messages and publish them to the broker. Normally, producers are: frontend applications, web pages, web services, backend services, proxies, and adapters to legacy systems. We can write Kafka producers in Java, Scala, C and Python.

The process begins when the producer connects to any live node and requests the metadata about the partitions leaders of a topic, to put the message directly to the partition lead broker.

Producer API

First, we need to understand the classes needed to write a producer:

  • Producer: The producer class is KafkaProducer <K, V> in org.apache.kafka.clients.producer.KafkaProducer

Kafka Producer is a type of Java Generic written in Scala, K specifies the type of the partition key and V specifies the message value.

  • ProducerRecord: The class is ProducerRecord <K, V> in org.apache.kafka.clients.producer.ProducerRecord

This class encapsulates the data required for establishing the connection with the brokers (broker list...

Consumers


Consumers are applications that consume the messages published by the broker. Normally, producers are: real-time analysis applications, near real-time analysis applications, noSQL solutions, data warehouses, backend services or subscriber-based solutions. We can write Kafka producers in JVM (Java, Groovy, Scala, Clojure, Ceylon) Python and C/C++.

The consumer subscribes for message consumption to a specific topic on the Kafka broker. The consumer then makes a fetch request to the lead broker to consume from the message partition by specifying the message offset. The consumer works in a pull model and always pulls all available messages from its current position.

Consumer API

In the Kafka 0.8.0 version there was two API types for consumers: High level APIs and low level APIs. In version 0.10.0 they were unified.

To use the consumer API with Maven we should use the coordinates:

<dependency> 
  <groupId>org.apache.kafka</groupId> 
  <artifactId>kafka-clients...

Integration


Processing small data amounts in real time is not a challenge when we use Java Messaging Service (JMS), but, if we learn from the LinkedIn experience, we will see that these processing systems have serious performance limitations when dealing with large data volumes. Moreover, these systems are a nightmare when we try to scale horizontally, because they don't.

Integration with Apache Spark

For this demo, we need a Kafka cluster up and running. Also, we need Spark installed on our machine and ready to be deployed.

Apache Spark has one utility class to create a data stream to be read from Kafka. As with any Spark project, we first need to create SparkConf and the Spark StreamingContext:

val sparkConf = new SparkConf().setAppName("SparkKafkaTest") 
val jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10)) 

The JavaStreamingContext is a Java friendly version of StreamingContext which is the main entry point for Spark streaming functionality.

We create the Hashset...

Administration


There are numerous tools provided by Kafka to manage features such as: cluster management, topic tools, and cluster mirroring. Let's see some of these tools.

Cluster tools

When replicating multiple partitions, we obtain several replicas of which one acts as leader, and the rest as followers. When there is no leader, a follower takes the leadership.

When we have to shut down the broker for maintenance activities, the new leader is elected sequentially. This brings significant I/O operations on Zookeeper. With a big cluster, this means delays in service.

To reach high availability, Kafka provides a tool for shutting down the brokers. This tool transfers the leadership among the replicas or to another broker. If we don't have in-sync replicas available, the tool fails to shut down the broker to ensure the data integrity.

This tool is used with this command:

[master@localhost kafka_2.10.0-0.0.0.0]# bin/kafka-run-class.sh kafka.admin. ShutdownBroker --zookeeper <zookeeper_host:port...

Summary


This was a complete review of Apache Kafka and we have touched upon many important facts about it. We have also learned the reason why Kafka was developed, Kafka installation, and its support for different types of clusters. We also explored Kafka's design approach, and wrote a few basic producers and consumers.

In this chapter, we learned how to set up a Kafka cluster with single and multiple brokers on a single node, run producers and consumers from the command line, and exchange some messages. We also discussed important settings about the broker. Finally, we discussed Kafka's integration with technologies such as Spark.

In the next chapter we will review some integration patterns with examples. Take a look at Chapter 7, Study Case 1 - Spark and Cassandra, where we take a look at an in-depth example of Kafka integration with the other technologies.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Fast Data Processing Systems with SMACK Stack
Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada