Packt+ | Advance your knowledge in tech

You're reading from Mastering Apache Storm

Product typeBook

Published inAug 2017

Reading LevelExpert

Publisher

ISBN-139781787125636

Edition1st Edition

Languages

Java

Tools

Storm Hadoop

Concepts

Data Processing

Author (1)

Ankit Jain

Chapter 11. Apache Log Processing with Storm

In the previous chapter, we covered how we can integrate Storm with Redis, HBase, Esper and Elasticsearch.

In this chapter, we are covering the most popular use case of Storm, which is log processing.

This chapter covers the following major sections:

Apache log processing elements
Installation of Logstash
Configuring Logstash to produce the Apache log into Kafka
Splitting the Apache log file
Calculating the country name, operating system type, and browser type
Identifying the search key words of your website
Persisting the process data
Kafka spout and defining the topology
Deploying the topology
Storing the data into Elasticsearch and reporting

Apache log processing elements

Log processing is becoming a necessity for every organization, as they need to collect the business information from log data. In this chapter, we are basically working on how we can process the Apache log data using Logstash, Kafka, Storm, and Elasticsearch to collect the business information.

The following diagram illustrates all the elements that we are developing in this chapter:

Figure 11.1: Log processing topology

Producing Apache log in Kafka using Logstash

As explained in Chapter 8, Integration of Storm and Kafka, Kafka is a distributed messaging queue and can integrate very well with Storm. In this section, we will show you how we can use Logstash to read the Apache log file and publish it into the Kafka Cluster. We are assuming you already have the Kafka Cluster running. The installation steps of the Kafka Cluster are outlined in Chapter 8, Integration of Storm and Kafka.

Installation of Logstash

Before moving on to the installation of Logstash, we are going to answer the questions: What is Logstash? Why are we using Logstash?

What is Logstash?

Logstash is a tool that is used to collect, filter/parse, and emit the data for future use. Collect, parse, and emit are divided into three sections, which are called input, filter, and output:

The input section is used to read the data from external sources. The common input sources are File, TCP port, Kafka, and so on.
The filter section is used to parse the...

Splitting the Apache log line

Now, we are creating a new topology, which will read the data from Kafka using the KafkaSpout spout. In this section, we are writing an ApacheLogSplitter bolt, that has a logic to fetch the IP, status code, referrer, bytes sent, and so on, information from the Apache log line. As this is a new topology, we must first create the new project.

Create a new Maven project with groupId as com.stormadvance and artifactId as logprocessing.
Add the following dependencies in the pom.xml file:

       <dependency> 
             <groupId>org.apache.storm</groupId> 
             <artifactId>storm-core</artifactId> 
             <version>1.0.2</version> 
             <scope>provided</scope> 
       </dependency> 
 
       <!-- Utilities --> 
       <dependency> 
             <groupId>commons-collections</groupId> 
             <artifactId>commons-collections</artifactId> 
        ...

Identifying country, operating system type, and browser type from the log file

This section explains how we can calculate the user country name, operation system type, and browser type by analyzing the Apache log line. By identifying the country name, we can easily identify the location where our site is getting more attention and the location where we are getting less attention. Let's perform the following steps to calculate the country name, operating system, and browser from the Apache log file:

We are using the open source geoip library to calculate the country name from the IP address. Add the following dependencies in the pom.xml file:

       <dependency> 
             <groupId>org.geomind</groupId> 
             <artifactId>geoip</artifactId> 
             <version>1.2.8</version> 
       </dependency>

Add the following repository into the pom.xml file:

        <repository> 
             <id>geoip</id> 
             <...

Calculate the search keyword

This section explains how we can calculate the search keyword from the referrer URL. Suppose a referrer URL is https://www.google.co.in/#q=learning+storm. We will pass this referrer URL to a class and the output of the class will be learning storm. By identifying the search keyword, we can easily identify the keywords users are searching to reach our site. Let's perform the following steps to calculate the keywords from the referrer URL:

We are creating a KeywordGenerator class in the com.stormadvance.logprocessing package. This class contains logic to generate the search keyword from the referrer URL. The following is the source code of the KeywordGenerator class:

/** 
 * This class takes referrer URL as input, analyze the URL and return search 
 * keyword as output. 
 *  
 */ 
public class KeywordGenerator { 
 public String getKeyword(String referer) { 
 
       String[] temp; 
       Pattern pat = Pattern.compile("[?&#]q=([^&]+)"); 
       Matcher m...

Persisting the process data

This section will explain how we can persist the process data into a data store. We are using MySQL as a data store for the log processing use case. I am assuming you have MySQL installed on your centOS machine or you can follow the blog at http://www.rackspace.com/knowledge_center/article/installing-mysql-server-on-centos to install the MySQL on the centOS machine. Let's perform the following steps to persist the record into MySQL:

Add the following dependency to pom.xml:

 
       <dependency> 
             <groupId>mysql</groupId> 
             <artifactId>mysql-connector-java</artifactId> 
             <version>5.1.6</version> 
       </dependency>

We are creating a MySQLConnection class in the com.stormadvance.logprocessing package. This class contains getMySQLConnection(String ip, String database, String user, String password) method, which returns the MySQL connection. The following is the source code of the...

Kafka spout and define topology

This section will explain how we can read the Apache log from a Kafka topic. This section also defines the LogProcessingTopology that will chain together all the bolts created in the preceding sections. Let's perform the following steps to consume the data from Kafka and define the topology:

Add the following dependency and repository for Kafka in the pom.xml file:

       <dependency> 
             <groupId>org.apache.storm</groupId> 
             <artifactId>storm-kafka</artifactId> 
             <version>1.0.2</version> 
             <exclusions> 
                   <exclusion> 
                         <groupId>org.apache.kafka</groupId> 
                         <artifactId>kafka-clients</artifactId> 
                   </exclusion> 
             </exclusions> 
       </dependency> 
 
       <dependency> 
             <groupId>org.apache.kafka<...

Deploy topology

This section will explain how we can deploy the LogProcessingTopology. Perform the following steps:

Execute the following command on the MySQL console to define the database schema:

mysql> create database apachelog; 
mysql> use apachelog; 
mysql> create table apachelog( 
       id INT NOT NULL AUTO_INCREMENT, 
       ip VARCHAR(100) NOT NULL, 
       dateTime VARCHAR(200) NOT NULL, 
       request VARCHAR(100) NOT NULL, 
       response VARCHAR(200) NOT NULL, 
       bytesSent VARCHAR(200) NOT NULL, 
        referrer VARCHAR(500) NOT NULL, 
       useragent VARCHAR(500) NOT NULL, 
       country VARCHAR(200) NOT NULL, 
       browser VARCHAR(200) NOT NULL, 
       os VARCHAR(200) NOT NULL, 
       keyword VARCHAR(200) NOT NULL, 
       PRIMARY KEY (id) 
 );

I am assuming you have already produced some data on the apache_log topic by using Logstash.
Go to the project home directory and run the following command to build the project:

> mvn clean install -DskipTests

Execute...

MySQL queries

This section will explain how we can analyze or query in store data to generate some statistics. We will cover the following:

Calculating the page hit from each country
Calculating the count of each browser
Calculating the count of each operating system

Calculate the page hit from each country

Run the following command on the MySQL console to calculate the page hit from each country:

mysql> select country, count(*) from apachelog group by country; 
+---------------------------+----------+ 
| country                   | count(*) | 
+---------------------------+----------+ 
| Asia/Pacific Region       |        9 | 
| Belarus                   |       12 | 
| Belgium                   |       12 | 
| Bosnia and Herzegovina    |       12 | 
| Brazil                    |       36 | 
| Bulgaria                  |       12 | 
| Canada                    |      218 | 
| Europe                    |       24 | 
| France                    |       44 | 
| Germany                   |    ...

Summary

In this chapter, we introduced you to how we can process the Apache log file, how we can identify the country name from the IP, how we can identify the user operating system and browser by analyzing the log file, and how we can identify the search keyword by analyzing the referrer field.

In the next chapter, we will learn how we can solve machine learning problems through Storm.

The rest of the chapter is locked

You have been reading a chapter from

Mastering Apache Storm

Published in: Aug 2017Publisher: ISBN-13: 9781787125636

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Other recommended products

Related to this chapter

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

BookSep 2017360 pages

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

BookAug 2017278 pages

Apache Kafka 1.0 Cookbook

Apache Kafka is an open source stream processing platform to handle real-time data feeds. This book is a highly practical guide to help you understand the fundamentals as well as the advanced applications of Apache Kafka as an enterprise messaging service. It begins with configuring the basic Kafka APIs, and then shows you how to set up Kafka clusters and basic Kafka operations. It covers the recently released Kafka version 1.0, the Confluent Platform and Kafka Streams. By the end of this book, you will have all the knowledge you need to take your understanding of Apache Kafka to the next level and tackle any problem you might encounter while working with it.

BookDec 2017250 pages

Apache Kafka Quick Start Guide

Learn how to use Apache Kafka for efficient processing of distributed applications. This book focuses on programming rather than configuration management of Kafka clusters or Dev Ops. Each chapter focuses on a practical aspect and tries to avoid the tedious theoretical sections. By the end of this book, you will be familiar with solving everyday problems in fast data and processing pipelines.

BookDec 2018186 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Hadoop 2.x Administration Cookbook

A practical and use case driven approach to Hadoop administration with coverage on a vast array of topics including Hadoop cluster installation, performance tuning, cluster planning, security, and much more. This book covers Hadoop from the perspective of running clusters in critical and large environments with complex data and at scale.

BookMay 2017348 pages

Microservices Deployment Cookbook

BookJan 2017378 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Learning Apache Apex

Applications that use and evaluate real-time streams need to take the features of the underlying processing engine into account. This is the first book about Apache Apex, teaching readers how to include the real-time streaming engine Apex in a functioning application, and which parts to add to make it performant and usable.

BookNov 2017290 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages