Reader small image

You're reading from  Data Lake for Enterprises

Product typeBook
Published inMay 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787281349
Edition1st Edition
Languages
Right arrow
Authors (3):
Vivek Mishra
Vivek Mishra
author image
Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

Tomcy John
Tomcy John
author image
Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

Pankaj Misra
Pankaj Misra
author image
Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra

View More author details
Right arrow

Chapter 7. Messaging Layer using Apache Kafka

Handling streamed data is a very important aspect of a Data Lake. In the Data Lake architecture discussed in this book, the handling of streamed data is the responsibility of the messaging layer. In this chapter, we will go into detail on this layer and will also discuss the technology that we have chosen to be a part of this layer doing the actual work.

We have chosen Apache Kafka as the fitting technology to be used in messaging layer. This chapter delves deep into this technology and it's architecture in regards to Data Lake.

Context in Data Lake - messaging layer


In this chapter, we are dealing with a technology which constitutes one of the core layers of Data Lake namely, the messaging layer. Its crucial to have a fully functional messaging layer for dealing with a real-time data stream flowing in from different applications in an enterprise.

The technology that we have shortlisted to do this very important job of handling such, stream data is Apache Kafka. This chapter will take you through the functioning of messaging layer and then deep dive into the technology, Kafka.

Messaging layer

In Chapter 2, Comprehensive Concepts of a Data Lake, you already a high-level view of the messaging layer and how it works, especially in the context of Data Lake.

Figure 01: Data Lake: Messaging layer

The messaging layer in Data Lake takes care of as mentioned in the bulleted list has a set of functions/capabilities:

  • One of the core capabilities of this layer is it's ability to decouple both the source (producer) and destination...

Why Apache Kafka


We are using Apache Kafka as the stream data platform (MOM: message-oriented middleware). Core reasons for choosing Kafka is it's high reliability and ability to deal with data with a very low latency.

Note

Message-oriented middleware (MOM) is software or hardware infrastructure supporting the sending and receiving of messages between distributed systems.

- Wikipedia

Apache Kafka has some key attributes attached to it making it an ideal choice for us in achieving the capability that we are looking to implement the Data Lake. They are bulleted below:

  • Scalability: Capable of handling high-velocity and high-volume data. Hundreds of megabytes per second throughput with terabytes of data.
  • Distributed: Kafka is distributed by design and handles some of the distributed capabilities as follows:
    • Replication: The replication feature is one of the default features which needs to be available for any distributed enabled technology and Kafka has this feature built-in.
    • Partition capable: Again...

Kafka architecture


This section aims to explain ins and outs of Apache Kafka. We will try to dive deep into its architecture and then, later on try expanding each part of it's architecture's components in a bit more detail.

So, let's stream forward.

Core architecture principles of Kafka

The main motivation behind Kafka when developed by LinkedIn’s engineering team was

To create a unified messaging platform to cater to real-time data from various applications in a big organization.

- LinkedIn

 

There are core architecture principles based on which Kafka was conceived and designed. The bulleted points sum up these principles:

  • Maximize performance (compression and B-tree usage is an example)
  • Wherever possible, core kernel capabilities to offload work to drive optimization and performance (zero-copy and direct use of Linux filesystem cache is an example)
  • Distributed architecture
  • Fault tolerance
  • Durability of messages
  • Wherever possible, eliminate redundant work
  • Offload responsibility of tasks to consuming...

Other Kafka components


In addition to these components, there are some important components in a Kafka deployment without which the Kafka won't work as intended. These are a couple of the important components that could be used:

  • Zookeeper
  • MirrorMaker

Zookeeper

Zookeeper is one of the very important hidden component, needed (mandatory) for Kafka to function properly. It is entrusted to do the following jobs:

  • Taking care of bringing each broker into the cluster membership.
  • Electing the Kafka controller which does some very important functions within the cluster such as managing the state of partitions and their replicas.
  • Complete topic configurations like number of partitions, leader partitions election, partition replication location and so on.
  • Access control list maintenance and various quotas within each broker.

MirrorMaker

As the name suggests, it helps mirror data cross Kafka clusters. This component can be used to mirror an entire cluster from one data center to another, as shown in the next figure...

Kafka programming interface


Kafka contains two programming interface mechanisms:

  • Low level core API’s
  • REST API’s: REST interface wrapping the core API’s for easy access

Kafka core API’s

These are the core API’s in Apache Kafka, as documented in the Apache Kafka documentation:

  • Producer API: Contains a set of API’s which allows us to publish a stream of data to one or more of the named/categorized Kafka topics in the cluster.
  • Streams API: Contains relevant API’s which acts on the stream of data. They can process this stream data and can transform it from existing form to a designated form according to your use case demands. These are relatively new API's as against existing producer and consumer API’s.
  • Connect API: API’s which allows Kafka to be extensible. It contains methods which can be used to build Kafka connectors for the inputting and outputting of data into Kafka.
  • Consumer API: Contains relevant API’s to subscribe to one or more topics in the broker. Since consumer takes care of a message...

Producer and consumer reliability


In distributed systems, components fail. Its a common practice to design your code to take care of these failures in a seamless fashion (fault-tolerant).

One of the ways by which Kafka tolerates failure is by maintaining the replication of messages. Messages are replicated in so called partitions and Kafka automatically elects one partition as leader and other follower partitions just replicate the leader. The leader also maintains a list of replicas which are in sync so as to make sure that ideal replication is maintained to handle failures.

The producer sends message to the topic (Kafka broker in Kafka cluster) and durability can be configured using the producer configuration, request.required.acks, which has the following values:

  • 0: message written to network/buffer
  • 1: message written directly to partition leader
  • all: producer gets an acknowledgement when all in-sync replicas (ISR’s) get the message

Consumer reads data from topics and in Kafka the state of...

Kafka security


When designed and developed at LinkedIn, security was kept out to a large extent. Security for Kafka was an afterthought after it became a main project at Apache. Later on in the year 2014, various security discussions were considered for Kafka, especially data at rest security and transport layer security.

Kafka broker allows clients to connect to multiple ports and each port supports a different security mechanism, as detailed here:

  • No wire encryption and authentication
  • SSL: wire encryption and authentication
  • SASL: Kerberos authentication
  • SSL + SASL: SSL is for wire encryption and SASL for authentication
  • Authorization similar to Unix permissions for read/write by a client

These security features are led by Confluent and more details can be found at http://docs.confluent.io/2.0.0/kafka/security.html.

Kafka as message-oriented middleware


Message-oriented middleware (MOM) is software or hardware infrastructure supporting sending and receiving messages between distributed systems. MOM allows application modules to be distributed over heterogeneous platforms and reduces the complexity of developing applications that span multiple operating systems and network protocols. The middleware creates a distributed communications layer that insulates the application developer from the details of the various operating systems and network interfaces.

- Wikipedia

Looking at the definition for MOM above, Kafka fits in the category of an MOM and does cater to all the capabilities needed by it. But, Kafka is not just a simple queue/message management solution and has certain core capabilities making it more marketable than traditional MOM. Some of it's inherent capabilities that are advantages are:

  • Approach used is log (distributed commit log) based with zero-copy and messages are always appended
  • Uses partitions...

Scale-out architecture with Kafka


Main principles on which Kafka works have been covered in this chapter earlier. We won't cover those again here; however below are the main reasons for scale-out architecture in Kafka:

  • Partition: Splits a topic into multiple partitions and increasing partitions is a mechanism of scaling.
  • Distribution: Cluster can have one or more brokers and these brokers can be increased to achieve scaling.
  • Replication: Similar to partitions, multiple replication of a message is there for fault-tolerance and this aspect also brings in scalability in Kafka.
  • Scaling: Each consumer reads a message from a single partition (of a topic) and to scale out we add more consumers and the newly added consumers read the message from new partition (one consumer cannot read from the same partition; this is a rule) as shown in this figure.

Figure 12: Scale out by adding more consumers

Kafka connect


Extensibility is one of the important design principle followed rigorously by Kafka. The Kafka Connect tool makes Kafka extensible. The tool enables Kafka to connect with external systems and helps bring data into it, and also out from it to other systems. It has a common framework, using which custom connectors can be written. More details on Kafka connect can be found in Kafka documentation in https://kafka.apache.org/documentation.html#connect.

The following figure shows how Kafka Connect works.

Figure 13: Kafka connect working

The Kafka connectors are categorized into two:

  • Source Connectors: Connectors which bring data into Kafka topics.
  • Sink Connectors: Connectors which take data away from topics into other external systems

There are a huge list of connectors available, catering to various external systems, using which Kafka can hook onto them. These existing connectors are again categorized broadly into two:

  • Certified connectors: Connectors which are written using the Kafka...

Kafka working example


We have briefly discussed a basic setup of Kafka as part of Flume examples. The basic setup of Kafka as listed there remains the same, hence the installation steps will remain the same, however we will also look additionally at usage examples of Kafka as a message broker.

The most natural programming language for Kafka is currently Scala or Java. Hence, to keep things simple, we will be using Java as our choice of language for examples.

Installation

  1. Download the Kafka binaries from the following link, using the command:
wget http://redrockdigimark.com/apachemirror/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz
  1. Change the  directory to a user directory, where we will want to extract the contents of the Kafka tarball using the following command. Let us refer the extracted KAFKA directory as ${KAFKA_HOME}:
tar -xzvf ${DOWNLOAD_DIRECTORY}/kafka_2.11-0.10.1.1.tgz
  1. Set KAFKA_HOME as environment variable using the following commands and add the same into ~/.bashrc:
export KAFKA_HOME=<PATH...

When to use Kafka


Kafka has its core capabilities making it a choice for our use case and these were documented in this chapter when we started of Kafka should be used:

  • When you need a highly distributed messaging system
  • When you need a messaging system which can scale out exponentially
  • When you need high throughput on publishing and subscribing
  • When you have varied consumers having varied capabilities by which to subscribe these published messages in the topics
  • When you need a fault tolerance operation
  • When you need durability in message delivery
  • Obviously, all of the preceding without tolerating performance degrade. With all the preceding it should be blazing fast in operation.

When not to use Kafka


For certain scenarios and use cases, you shouldn't use Kafka:

  • If you need to have your messages processed in order, you need to have one consumer and one partition. But this is not at all the way Kafka works and we do have multiple consumers and multiple partitions (by design one consumer consumes from one partition) and because of this, it won't serve the use case that we are looking to implement.
  • If you need to implement a task queue because of the same reason in the preceding point. It doesn't have all the bells and whistles that you associate with a typical queue service.
  • If you need a typical topic capability (first in first out) as the way it functions is quite different.
  • If your development and production environment is Windows or Node.js based (subjective point but it's good to know that this aspect is quite true).
  • If you need high security with finer controls. The original design of Kafka is not really created with security in mind and this plagues Kafka at times...

Other options


There are sections in this chapter that details advantages of using Kafka. Also in this chapter, there are sections that details disadvantages and when not to use Kafka. That means, Kafka for us is just a choice suited for the topic that we are covering in this book and also for the SCV use case. The main reason for this choice is because of Kafka's clear advantages; especially when dealing with big data and its associated technologies.

There are other options in market which is a full-fledged messaging system (MOM) and possess rich features compared to Kafka. Some of the alternatives that we think you could look into and replace Kafka are briefly summarized in this section. In no way we mean to say that these cannot be used in our use case, just that we thought Kafka is the best fit. If we are to look at other options in place of Kafka these alternatives are our favorites.

All the technology choices have been made after careful technical analysis and with our book we want to...

Summary


The topic covered in this chapter is quite exhaustive. Rest assured that you have covered enough of Apache Kafka to implement a Data Lake.

In this chapter, we started with the relevance of messaging layer in the context of a Data Lake. After that, the chapter deep dived into Kafka and detailed it's architecture and its various components. It then showed you the full working example of Kafka with step-by-step instructions from installation all the way to taking data from a source, to destination using Kafka. Finally, as in other chapters we introduced other choices which can replace Kafka as the technology to achieve the same capability.

After reading this chapter, you should now have a clear idea of the messaging layer and a deep understanding of Apache Kafka and how it works. You should also have a clear idea of how our use case can use this technology and what exactly it accomplishes.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Lake for Enterprises
Published in: May 2017Publisher: PacktISBN-13: 9781787281349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

author image
Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

author image
Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra