About this book

Message publishing is a mechanism of connecting heterogeneous applications together with messages that are routed between them, for example by using a message broker like Apache Kafka. Such solutions deal with real-time volumes of information and route it to multiple consumers without letting information producers know who the final consumers are.

Apache Kafka is a practical, hands-on guide providing you with a series of step-by-step practical implementations, which will help you take advantage of the real power behind Kafka, and give you a strong grounding for using it in your publisher-subscriber based architectures.

Apache Kafka takes you through a number of clear, practical implementations that will help you to take advantage of the power of Apache Kafka, quickly and painlessly. You will learn everything you need to know for setting up Kafka clusters. This book explains how Kafka basic blocks like producers, brokers, and consumers actually work and fit together. You will then explore additional settings and configuration changes to achieve ever more complex goals. Finally you will learn how Kafka works with other tools like Hadoop, Storm, and so on.

You will learn everything you need to know to work with Apache Kafka in the right format, as well as how to leverage its power of handling hundreds of megabytes of messages per second from multiple clients.

Publication date:
October 2013


Chapter 1. Introducing Kafka

Welcome to the world of Apache Kafka.

In today's world, real-time information is continuously getting generated by applications (business, social, or any other type), and this information needs easy ways to be reliably and quickly routed to multiple types of receivers. Most of the time, applications that are producing information and applications that are consuming this information are well apart and inaccessible to each other. This, at times, leads to redevelopment of information producers or consumers to provide an integration point between them. Therefore, a mechanism is required for seamless integration of information of producers and consumers to avoid any kind of rewriting of an application at either end.

In the present big data era, the very first challenge is to collect the data as it is a huge amount of data and the second challenge is to analyze it. This analysis typically includes following type of data and much more:

  • User behavior data

  • Application performance tracing

  • Activity data in the form of logs

  • Event messages

Message publishing is a mechanism for connecting various applications with the help of messages that are routed between them, for example, by a message broker such as Kafka. Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are.

Apache Kafka is an open source, distributed publish-subscribe messaging system, mainly designed with the following characteristics:

  • Persistent messaging: To derive the real value from big data, any kind of information loss cannot be afforded. Apache Kafka is designed with O(1) disk structures that provide constant-time performance even with very large volumes of stored messages, which is in order of TB.

  • High throughput: Keeping big data in mind, Kafka is designed to work on commodity hardware and to support millions of messages per second.

  • Distributed: Apache Kafka explicitly supports messages partitioning over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.

  • Multiple client support: Apache Kafka system supports easy integration of clients from different platforms such as Java, .NET, PHP, Ruby, and Python.

  • Real time: Messages produced by the producer threads should be immediately visible to consumer threads; this feature is critical to event-based systems such as Complex Event Processing (CEP) systems.

Kafka provides a real-time publish-subscribe solution, which overcomes the challenges of real-time data usage for consumption, for data volumes that may grow in order of magnitude, larger that the real data. Kafka also supports parallel data loading in the Hadoop systems.

The following diagram shows a typical big data aggregation-and-analysis scenario supported by the Apache Kafka messaging system:

At the production side, there are different kinds of producers, such as the following:

  • Frontend web applications generating application logs

  • Producer proxies generating web analytics logs

  • Producer adapters generating transformation logs

  • Producer services generating invocation trace logs

At the consumption side, there are different kinds of consumers, such as the following:

  • Offline consumers that are consuming messages and storing them in Hadoop or traditional data warehouse for offline analysis

  • Near real-time consumers that are consuming messages and storing them in any NoSQL datastore such as HBase or Cassandra for near real-time analytics

  • Real-time consumers that filter messages in the in-memory database and trigger alert events for related groups


Need for Kafka

A large amount of data is generated by companies having any form of web-based presence and activity. Data is one of the newer ingredients in these Internet-based systems and typically includes user-activity events corresponding to logins, page visits, clicks, social networking activities such as likes, sharing, and comments, and operational and system metrics. This data is typically handled by logging and traditional log aggregation solutions due to high throughput (millions of messages per second). These traditional solutions are the viable solutions for providing logging data to an offline analysis system such as Hadoop. However, the solutions are very limiting for building real-time processing systems.

According to the new trends in Internet applications, activity data has become a part of production data and is used to run analytics at real time. These analytics can be:

  • Search based on relevance

  • Recommendations based on popularity, co-occurrence, or sentimental analysis

  • Delivering advertisements to the masses

  • Internet application security from spam or unauthorized data scraping

Real-time usage of these multiple sets of data collected from production systems has become a challenge because of the volume of data collected and processed.

Apache Kafka aims to unify offline and online processing by providing a mechanism for parallel load in Hadoop systems as well as the ability to partition real-time consumption over a cluster of machines. Kafka can be compared with Scribe or Flume as it is useful for processing activity stream data; but from the architecture perspective, it is closer to traditional messaging systems such as ActiveMQ or RabitMQ.


Few Kafka usages

Some of the companies that are using Apache Kafka in their respective use cases are as follows:

  • LinkedIn (www.linkedin.com): Apache Kafka is used at LinkedIn for the streaming of activity data and operational metrics. This data powers various products such as LinkedIn news feed and LinkedIn Today in addition to offline analytics systems such as Hadoop.

  • DataSift (www.datasift.com/): At DataSift, Kafka is used as a collector for monitoring events and as a tracker of users' consumption of data streams in real time.

  • Twitter (www.twitter.com/): Twitter uses Kafka as a part of its Storm— a stream-processing infrastructure.

  • Foursquare (www.foursquare.com/): Kafka powers online-to-online and online-to-offline messaging at Foursquare. It is used to integrate Foursquare monitoring and production systems with Foursquare, Hadoop-based offline infrastructures.

  • Square (www.squareup.com/): Square uses Kafka as a bus to move all system events through Square's various datacenters. This includes metrics, logs, custom events, and so on. On the consumer side, it outputs into Splunk, Graphite, or Esper-like real-time alerting.


The source of the above information is https://cwiki.apache.org/confluence/display/KAFKA/Powered+By.



In this chapter, we have seen how companies are evolving the mechanism of collecting and processing application-generated data, and that of utilizing the real power of this data by running analytics over it.

In the next chapter we will look at the steps required to install Kafka.

About the Author

  • Nishant Garg

    Nishant Garg has over 17 years' software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a technical architect for the Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant has enjoyed working with some of the most recognizable names in IT services and financial industries, employing full software life cycle methodologies such as Agile and SCRUM. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Apache Kafka and HBase Essentials, Packt Publishing.

    Browse publications by this author
Book Title
Access this book and the full library for just $5/m.
Access now