Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing Systems with SMACK Stack

Product typeBook

Published inDec 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781786467201

Edition1st Edition

Languages

Scala

Tools

Mesos Apache Spark

Concepts

Data Processing

Author (1)

Raúl Estrada

Chapter 5. The Broker - Apache Kafka

The aim of this chapter is to familiarize the reader with Apache Kafka and show you how to solve the consumption of millions of messages in pipeline architecture. Here we show some Scala examples to give you a foundation for the different types of implementation and integration for Kafka producers and consumers.

In addition to the explanation of the Apache Kafka Architecture and principles, we'll explore the Kafka integration with the rest of the SMACK stack, specifically with Spark. At the end of the chapter, we will learn how to administrate Apache Kafka.

This chapter has the following sections:

Introducing Kafka
Installation
Cluster
Architecture
Producers
Consumers
Integration
Administration

Introducing Kafka

Jay Kreps, the author of Apache Kafka says about the Kafka name:

I thought that since Kafka was a system optimized for writing using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

So basically there is not much of a relationship.

Apache Kafka is mainly optimized for writing (in this book when we say optimized we mean two million writes per second on a commodity cluster).

Nowadays, real-time information is continuously generated; this data needs easy ways to be delivered to multiple receivers. Most of the time, generators and consumers of information are inaccessible to each other, and here is when integration tools are required.

In the eighties, nineties and two thousands, the large software vendors (IBM, SAP, BEA, Oracle, Microsoft, Google, and so on) found a very lucrative market in the integration layer. Here we can find enterprise service buses, SOA architectures...

Installation

Go to the Apache Kafka home page: kafka.apache.org/downloads as in Figure 5-2, Apache Kafka download page.

Figure 5-2. Apache Kafka download page

The Apache Kafka current version available is 0.10.0.0 as a stable release. A major limitation with Kafka since 0.8.x is that it is not backward-compatible. So, we cannot replace this version for one prior to 0.8. Once you've downloaded the latest available release, let's proceed with the installation.

Installing Java

We need Java 1.7 or later. Download and install the latest JDK from Oracle's website: http://www.oracle.com/technetwork/java/javase/downloads/index.html .

To install in Linux (as an example):

Change the file mode:

[master@localhost opt]# chmod +x jdk-8u91-linux-x64.rpm

Go to the directory in which you want to perform the installation:
```
[master@localhost opt]# cd <directory path name>
```

Run the rpm installer with the command:

[master@localhost java]# rpm -ivh jdk-8u91-linux-x64.rpm

Finally, add the environment variable...

Cluster

Now we are ready to program with the Apache Kafka publisher-subscriber messaging system. First, a few terminologies:

In Kafka, there are three types of clusters:

Single node - single broker
Single node - multiple broker
Multiple node - multiple broker

A Kafka cluster has five main actors:

Broker: The server - a Kafka cluster has one or more physical servers where each one may have one or more server processes running.Each server process is called a broker. The topics live on the broker processes.
Topic: The queue is a category or feed name in which messages are published by the message producers. Topics are partitioned, and each partition is represented by an ordered immutable messages sequence. The cluster has a partitioned log for each topic. Each message in the partition has a unique sequential ID called offset.
Producer: These publish data to topics by choosing the appropriate partition in the topic. To achieve load balancing, the message allocation to the topic partition can be done...

Architecture

Kafka was created at LinkedIn. To start with LinkedIn used Java Message Service (JMS). But when they needed more power, that is, a scalable architecture, the LinkedIn development team decided to build the project that today we know as Kafka. In 2011, Kafka was open sourced as theApache Project. Due to size constraints, in this section we'll leave the reader with some reflections on why the architecture is designed in the way it is.

The Figure 5-6, A topic with 3 partitions, shows a topic with three partitions. We can see the five Kafka components: Zookeeper, broker, topic, producer and consumer.

Figure 5-6. A topic with 3 partitions

The Kafka project goals are:

An API: To support the custom implementation of producers and consumers
Low overhead: Low network latency and low storage overhead with message persistence on disk
High throughput: Publishing and subscribing of millions of messages, supporting data feeds in real time
Distributed: Highly scalable architecture to handle low...

Producers

Producers are applications that create messages and publish them to the broker. Normally, producers are: frontend applications, web pages, web services, backend services, proxies, and adapters to legacy systems. We can write Kafka producers in Java, Scala, C and Python.

The process begins when the producer connects to any live node and requests the metadata about the partitions leaders of a topic, to put the message directly to the partition lead broker.

Producer API

First, we need to understand the classes needed to write a producer:

Producer: The producer class is KafkaProducer <K, V> in org.apache.kafka.clients.producer.KafkaProducer

Kafka Producer is a type of Java Generic written in Scala, K specifies the type of the partition key and V specifies the message value.

ProducerRecord: The class is ProducerRecord <K, V> in org.apache.kafka.clients.producer.ProducerRecord

This class encapsulates the data required for establishing the connection with the brokers (broker list...

Consumers

Consumers are applications that consume the messages published by the broker. Normally, producers are: real-time analysis applications, near real-time analysis applications, noSQL solutions, data warehouses, backend services or subscriber-based solutions. We can write Kafka producers in JVM (Java, Groovy, Scala, Clojure, Ceylon) Python and C/C++.

The consumer subscribes for message consumption to a specific topic on the Kafka broker. The consumer then makes a fetch request to the lead broker to consume from the message partition by specifying the message offset. The consumer works in a pull model and always pulls all available messages from its current position.

Consumer API

In the Kafka 0.8.0 version there was two API types for consumers: High level APIs and low level APIs. In version 0.10.0 they were unified.

To use the consumer API with Maven we should use the coordinates:

<dependency> 
  <groupId>org.apache.kafka</groupId> 
  <artifactId>kafka-clients...

Integration

Processing small data amounts in real time is not a challenge when we use Java Messaging Service (JMS), but, if we learn from the LinkedIn experience, we will see that these processing systems have serious performance limitations when dealing with large data volumes. Moreover, these systems are a nightmare when we try to scale horizontally, because they don't.

Integration with Apache Spark

For this demo, we need a Kafka cluster up and running. Also, we need Spark installed on our machine and ready to be deployed.

Apache Spark has one utility class to create a data stream to be read from Kafka. As with any Spark project, we first need to create SparkConf and the Spark StreamingContext:

val sparkConf = new SparkConf().setAppName("SparkKafkaTest") 
val jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10))

The JavaStreamingContext is a Java friendly version of StreamingContext which is the main entry point for Spark streaming functionality.

We create the Hashset...

Administration

There are numerous tools provided by Kafka to manage features such as: cluster management, topic tools, and cluster mirroring. Let's see some of these tools.

Cluster tools

When replicating multiple partitions, we obtain several replicas of which one acts as leader, and the rest as followers. When there is no leader, a follower takes the leadership.

When we have to shut down the broker for maintenance activities, the new leader is elected sequentially. This brings significant I/O operations on Zookeeper. With a big cluster, this means delays in service.

To reach high availability, Kafka provides a tool for shutting down the brokers. This tool transfers the leadership among the replicas or to another broker. If we don't have in-sync replicas available, the tool fails to shut down the broker to ensure the data integrity.

This tool is used with this command:

[master@localhost kafka_2.10.0-0.0.0.0]# bin/kafka-run-class.sh kafka.admin. ShutdownBroker --zookeeper <zookeeper_host:port...

Summary

This was a complete review of Apache Kafka and we have touched upon many important facts about it. We have also learned the reason why Kafka was developed, Kafka installation, and its support for different types of clusters. We also explored Kafka's design approach, and wrote a few basic producers and consumers.

In this chapter, we learned how to set up a Kafka cluster with single and multiple brokers on a single node, run producers and consumers from the command line, and exchange some messages. We also discussed important settings about the broker. Finally, we discussed Kafka's integration with technologies such as Spark.

In the next chapter we will review some integration patterns with examples. Take a look at Chapter 7, Study Case 1 - Spark and Cassandra, where we take a look at an in-depth example of Kafka integration with the other technologies.

The rest of the chapter is locked

You have been reading a chapter from

Fast Data Processing Systems with SMACK Stack

Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages