Packt+ | Advance your knowledge in tech

You're reading from Data Lake for Enterprises

Product typeBook

Published inMay 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781787281349

Edition1st Edition

Languages

Java

Tools

Kafka Hadoop

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Chapter 7. Messaging Layer using Apache Kafka

Handling streamed data is a very important aspect of a Data Lake. In the Data Lake architecture discussed in this book, the handling of streamed data is the responsibility of the messaging layer. In this chapter, we will go into detail on this layer and will also discuss the technology that we have chosen to be a part of this layer doing the actual work.

We have chosen Apache Kafka as the fitting technology to be used in messaging layer. This chapter delves deep into this technology and it's architecture in regards to Data Lake.

Context in Data Lake - messaging layer

In this chapter, we are dealing with a technology which constitutes one of the core layers of Data Lake namely, the messaging layer. Its crucial to have a fully functional messaging layer for dealing with a real-time data stream flowing in from different applications in an enterprise.

The technology that we have shortlisted to do this very important job of handling such, stream data is Apache Kafka. This chapter will take you through the functioning of messaging layer and then deep dive into the technology, Kafka.

Messaging layer

In Chapter 2, Comprehensive Concepts of a Data Lake, you already a high-level view of the messaging layer and how it works, especially in the context of Data Lake.

Figure 01: Data Lake: Messaging layer

The messaging layer in Data Lake takes care of as mentioned in the bulleted list has a set of functions/capabilities:

One of the core capabilities of this layer is it's ability to decouple both the source (producer) and destination...

Why Apache Kafka

We are using Apache Kafka as the stream data platform (MOM: message-oriented middleware). Core reasons for choosing Kafka is it's high reliability and ability to deal with data with a very low latency.

Note

Message-oriented middleware (MOM) is software or hardware infrastructure supporting the sending and receiving of messages between distributed systems.

- Wikipedia

Apache Kafka has some key attributes attached to it making it an ideal choice for us in achieving the capability that we are looking to implement the Data Lake. They are bulleted below:

Scalability: Capable of handling high-velocity and high-volume data. Hundreds of megabytes per second throughput with terabytes of data.
Distributed: Kafka is distributed by design and handles some of the distributed capabilities as follows:
- Replication: The replication feature is one of the default features which needs to be available for any distributed enabled technology and Kafka has this feature built-in.
- Partition capable: Again...

Kafka architecture

This section aims to explain ins and outs of Apache Kafka. We will try to dive deep into its architecture and then, later on try expanding each part of it's architecture's components in a bit more detail.

So, let's stream forward.

Core architecture principles of Kafka

The main motivation behind Kafka when developed by LinkedIn’s engineering team was

To create a unified messaging platform to cater to real-time data from various applications in a big organization.

- LinkedIn

There are core architecture principles based on which Kafka was conceived and designed. The bulleted points sum up these principles:

Maximize performance (compression and B-tree usage is an example)
Wherever possible, core kernel capabilities to offload work to drive optimization and performance (zero-copy and direct use of Linux filesystem cache is an example)
Distributed architecture
Fault tolerance
Durability of messages
Wherever possible, eliminate redundant work
Offload responsibility of tasks to consuming...

Other Kafka components

In addition to these components, there are some important components in a Kafka deployment without which the Kafka won't work as intended. These are a couple of the important components that could be used:

Zookeeper
MirrorMaker

Zookeeper

Zookeeper is one of the very important hidden component, needed (mandatory) for Kafka to function properly. It is entrusted to do the following jobs:

Taking care of bringing each broker into the cluster membership.
Electing the Kafka controller which does some very important functions within the cluster such as managing the state of partitions and their replicas.
Complete topic configurations like number of partitions, leader partitions election, partition replication location and so on.
Access control list maintenance and various quotas within each broker.

MirrorMaker

As the name suggests, it helps mirror data cross Kafka clusters. This component can be used to mirror an entire cluster from one data center to another, as shown in the next figure...

Kafka programming interface

Kafka contains two programming interface mechanisms:

Low level core API’s
REST API’s: REST interface wrapping the core API’s for easy access

Kafka core API’s

These are the core API’s in Apache Kafka, as documented in the Apache Kafka documentation:

Producer API: Contains a set of API’s which allows us to publish a stream of data to one or more of the named/categorized Kafka topics in the cluster.
Streams API: Contains relevant API’s which acts on the stream of data. They can process this stream data and can transform it from existing form to a designated form according to your use case demands. These are relatively new API's as against existing producer and consumer API’s.
Connect API: API’s which allows Kafka to be extensible. It contains methods which can be used to build Kafka connectors for the inputting and outputting of data into Kafka.
Consumer API: Contains relevant API’s to subscribe to one or more topics in the broker. Since consumer takes care of a message...

Producer and consumer reliability

In distributed systems, components fail. Its a common practice to design your code to take care of these failures in a seamless fashion (fault-tolerant).

One of the ways by which Kafka tolerates failure is by maintaining the replication of messages. Messages are replicated in so called partitions and Kafka automatically elects one partition as leader and other follower partitions just replicate the leader. The leader also maintains a list of replicas which are in sync so as to make sure that ideal replication is maintained to handle failures.

The producer sends message to the topic (Kafka broker in Kafka cluster) and durability can be configured using the producer configuration, request.required.acks, which has the following values:

0: message written to network/buffer
1: message written directly to partition leader
all: producer gets an acknowledgement when all in-sync replicas (ISR’s) get the message

Consumer reads data from topics and in Kafka the state of...

Kafka security

When designed and developed at LinkedIn, security was kept out to a large extent. Security for Kafka was an afterthought after it became a main project at Apache. Later on in the year 2014, various security discussions were considered for Kafka, especially data at rest security and transport layer security.

Kafka broker allows clients to connect to multiple ports and each port supports a different security mechanism, as detailed here:

No wire encryption and authentication
SSL: wire encryption and authentication
SASL: Kerberos authentication
SSL + SASL: SSL is for wire encryption and SASL for authentication
Authorization similar to Unix permissions for read/write by a client

These security features are led by Confluent and more details can be found at http://docs.confluent.io/2.0.0/kafka/security.html.

Kafka as message-oriented middleware

Message-oriented middleware (MOM) is software or hardware infrastructure supporting sending and receiving messages between distributed systems. MOM allows application modules to be distributed over heterogeneous platforms and reduces the complexity of developing applications that span multiple operating systems and network protocols. The middleware creates a distributed communications layer that insulates the application developer from the details of the various operating systems and network interfaces.

- Wikipedia

Looking at the definition for MOM above, Kafka fits in the category of an MOM and does cater to all the capabilities needed by it. But, Kafka is not just a simple queue/message management solution and has certain core capabilities making it more marketable than traditional MOM. Some of it's inherent capabilities that are advantages are:

Approach used is log (distributed commit log) based with zero-copy and messages are always appended
Uses partitions...

Scale-out architecture with Kafka

Main principles on which Kafka works have been covered in this chapter earlier. We won't cover those again here; however below are the main reasons for scale-out architecture in Kafka:

Partition: Splits a topic into multiple partitions and increasing partitions is a mechanism of scaling.
Distribution: Cluster can have one or more brokers and these brokers can be increased to achieve scaling.
Replication: Similar to partitions, multiple replication of a message is there for fault-tolerance and this aspect also brings in scalability in Kafka.
Scaling: Each consumer reads a message from a single partition (of a topic) and to scale out we add more consumers and the newly added consumers read the message from new partition (one consumer cannot read from the same partition; this is a rule) as shown in this figure.

Figure 12: Scale out by adding more consumers

Kafka connect

Extensibility is one of the important design principle followed rigorously by Kafka. The Kafka Connect tool makes Kafka extensible. The tool enables Kafka to connect with external systems and helps bring data into it, and also out from it to other systems. It has a common framework, using which custom connectors can be written. More details on Kafka connect can be found in Kafka documentation in https://kafka.apache.org/documentation.html#connect.

The following figure shows how Kafka Connect works.

Figure 13: Kafka connect working

The Kafka connectors are categorized into two:

Source Connectors: Connectors which bring data into Kafka topics.
Sink Connectors: Connectors which take data away from topics into other external systems

There are a huge list of connectors available, catering to various external systems, using which Kafka can hook onto them. These existing connectors are again categorized broadly into two:

Certified connectors: Connectors which are written using the Kafka...

Kafka working example

We have briefly discussed a basic setup of Kafka as part of Flume examples. The basic setup of Kafka as listed there remains the same, hence the installation steps will remain the same, however we will also look additionally at usage examples of Kafka as a message broker.

The most natural programming language for Kafka is currently Scala or Java. Hence, to keep things simple, we will be using Java as our choice of language for examples.

Installation

Download the Kafka binaries from the following link, using the command:

wget http://redrockdigimark.com/apachemirror/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz

Change the directory to a user directory, where we will want to extract the contents of the Kafka tarball using the following command. Let us refer the extracted KAFKA directory as ${KAFKA_HOME}:

tar -xzvf ${DOWNLOAD_DIRECTORY}/kafka_2.11-0.10.1.1.tgz

Set KAFKA_HOME as environment variable using the following commands and add the same into ~/.bashrc:

export KAFKA_HOME=<PATH...

When to use Kafka

Kafka has its core capabilities making it a choice for our use case and these were documented in this chapter when we started of Kafka should be used:

When you need a highly distributed messaging system
When you need a messaging system which can scale out exponentially
When you need high throughput on publishing and subscribing
When you have varied consumers having varied capabilities by which to subscribe these published messages in the topics
When you need a fault tolerance operation
When you need durability in message delivery
Obviously, all of the preceding without tolerating performance degrade. With all the preceding it should be blazing fast in operation.

When not to use Kafka

For certain scenarios and use cases, you shouldn't use Kafka:

If you need to have your messages processed in order, you need to have one consumer and one partition. But this is not at all the way Kafka works and we do have multiple consumers and multiple partitions (by design one consumer consumes from one partition) and because of this, it won't serve the use case that we are looking to implement.
If you need to implement a task queue because of the same reason in the preceding point. It doesn't have all the bells and whistles that you associate with a typical queue service.
If you need a typical topic capability (first in first out) as the way it functions is quite different.
If your development and production environment is Windows or Node.js based (subjective point but it's good to know that this aspect is quite true).
If you need high security with finer controls. The original design of Kafka is not really created with security in mind and this plagues Kafka at times...

Other options

There are sections in this chapter that details advantages of using Kafka. Also in this chapter, there are sections that details disadvantages and when not to use Kafka. That means, Kafka for us is just a choice suited for the topic that we are covering in this book and also for the SCV use case. The main reason for this choice is because of Kafka's clear advantages; especially when dealing with big data and its associated technologies.

There are other options in market which is a full-fledged messaging system (MOM) and possess rich features compared to Kafka. Some of the alternatives that we think you could look into and replace Kafka are briefly summarized in this section. In no way we mean to say that these cannot be used in our use case, just that we thought Kafka is the best fit. If we are to look at other options in place of Kafka these alternatives are our favorites.

All the technology choices have been made after careful technical analysis and with our book we want to...

Summary

The topic covered in this chapter is quite exhaustive. Rest assured that you have covered enough of Apache Kafka to implement a Data Lake.

In this chapter, we started with the relevance of messaging layer in the context of a Data Lake. After that, the chapter deep dived into Kafka and detailed it's architecture and its various components. It then showed you the full working example of Kafka with step-by-step instructions from installation all the way to taking data from a source, to destination using Kafka. Finally, as in other chapters we introduced other choices which can replace Kafka as the technology to achieve the same capability.

After reading this chapter, you should now have a clear idea of the messaging layer and a deep understanding of Apache Kafka and how it works. You should also have a clear idea of how our use case can use this technology and what exactly it accomplishes.

The rest of the chapter is locked

You have been reading a chapter from

Data Lake for Enterprises

Published in: May 2017Publisher: PacktISBN-13: 9781787281349

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra

Other recommended products

Related to this chapter

Apache Kafka 1.0 Cookbook

Apache Kafka is an open source stream processing platform to handle real-time data feeds. This book is a highly practical guide to help you understand the fundamentals as well as the advanced applications of Apache Kafka as an enterprise messaging service. It begins with configuring the basic Kafka APIs, and then shows you how to set up Kafka clusters and basic Kafka operations. It covers the recently released Kafka version 1.0, the Confluent Platform and Kafka Streams. By the end of this book, you will have all the knowledge you need to take your understanding of Apache Kafka to the next level and tackle any problem you might encounter while working with it.

BookDec 2017250 pages

Apache Kafka Quick Start Guide

Learn how to use Apache Kafka for efficient processing of distributed applications. This book focuses on programming rather than configuration management of Kafka clusters or Dev Ops. Each chapter focuses on a practical aspect and tries to avoid the tedious theoretical sections. By the end of this book, you will be familiar with solving everyday problems in fast data and processing pipelines.

BookDec 2018186 pages

Learning Apache Flink

BookFeb 2017280 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

BookAug 2017278 pages

Hadoop 2.x Administration Cookbook

A practical and use case driven approach to Hadoop administration with coverage on a vast array of topics including Hadoop cluster installation, performance tuning, cluster planning, security, and much more. This book covers Hadoop from the perspective of running clusters in critical and large environments with complex data and at scale.

BookMay 2017348 pages

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

BookSep 2017360 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

MySQL 8 for Big Data

MySQL is one of the most popular relational databases in the world today, and has become a popular choice of tool to handle vast amounts of structured data - that is, structured Big Data. This book will demonstrate how you can dabble with large amounts of data using MySQL 8. It also highlights topics such as integrating MySQL 8 and a Big Data solution like Apache Hadoop using different tools like Apache Sqoop and MySQL Applier. With practical examples and use-cases, you will get a better clarity on how you can leverage the offerings of MySQL 8 to build a robust Big Data solution.

BookOct 2017296 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages