Reader small image

You're reading from  Mastering Apache Storm

Product typeBook
Published inAug 2017
Reading LevelExpert
Publisher
ISBN-139781787125636
Edition1st Edition
Languages
Right arrow
Author (1)
Ankit Jain
Ankit Jain
author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Right arrow

Chapter 11. Apache Log Processing with Storm

In the previous chapter, we covered how we can integrate Storm with Redis, HBase, Esper and Elasticsearch.

In this chapter, we are covering the most popular use case of Storm, which is log processing.

This chapter covers the following major sections:

  • Apache log processing elements
  • Installation of Logstash
  • Configuring Logstash to produce the Apache log into Kafka
  • Splitting the Apache log file
  • Calculating the country name, operating system type, and browser type
  • Identifying the search key words of your website
  • Persisting the process data
  • Kafka spout and defining the topology
  • Deploying the topology
  • Storing the data into Elasticsearch and reporting

Apache log processing elements


Log processing is becoming a necessity for every organization, as they need to collect the business information from log data. In this chapter, we are basically working on how we can process the Apache log data using Logstash, Kafka, Storm, and Elasticsearch to collect the business information.

The following diagram illustrates all the elements that we are developing in this chapter:

Figure 11.1: Log processing topology

Producing Apache log in Kafka using Logstash


As explained in Chapter 8, Integration of Storm and Kafka, Kafka is a distributed messaging queue and can integrate very well with Storm. In this section, we will show you how we can use Logstash to read the Apache log file and publish it into the Kafka Cluster. We are assuming you already have the Kafka Cluster running. The installation steps of the Kafka Cluster are outlined in Chapter 8, Integration of Storm and Kafka.

Installation of Logstash

Before moving on to the installation of Logstash, we are going to answer the questions: What is Logstash? Why are we using Logstash?

What is Logstash?

Logstash is a tool that is used to collect, filter/parse, and emit the data for future use. Collect, parse, and emit are divided into three sections, which are called input, filter, and output:

  • The input section is used to read the data from external sources. The common input sources are File, TCP port, Kafka, and so on.
  • The filter section is used to parse the...

Splitting the Apache log line


Now, we are creating a new topology, which will read the data from Kafka using the KafkaSpout spout. In this section, we are writing an ApacheLogSplitter bolt, that has a logic to fetch the IP, status code, referrer, bytes sent, and so on, information from the Apache log line. As this is a new topology, we must first create the new project.

  1. Create a new Maven project with groupId as com.stormadvance and artifactId as logprocessing.
  2. Add the following dependencies in the pom.xml file:
       <dependency> 
             <groupId>org.apache.storm</groupId> 
             <artifactId>storm-core</artifactId> 
             <version>1.0.2</version> 
             <scope>provided</scope> 
       </dependency> 
 
       <!-- Utilities --> 
       <dependency> 
             <groupId>commons-collections</groupId> 
             <artifactId>commons-collections</artifactId> 
        ...

Identifying country, operating system type, and browser type from the log file


This section explains how we can calculate the user country name, operation system type, and browser type by analyzing the Apache log line. By identifying the country name, we can easily identify the location where our site is getting more attention and the location where we are getting less attention. Let's perform the following steps to calculate the country name, operating system, and browser from the Apache log file:

  1. We are using the open source geoip library to calculate the country name from the IP address. Add the following dependencies in the pom.xml file:
       <dependency> 
             <groupId>org.geomind</groupId> 
             <artifactId>geoip</artifactId> 
             <version>1.2.8</version> 
       </dependency> 
  1. Add the following repository into the pom.xml file:
        <repository> 
             <id>geoip</id> 
             <...

Calculate the search keyword


This section explains how we can calculate the search keyword from the referrer URL. Suppose a referrer URL is https://www.google.co.in/#q=learning+storm. We will pass this referrer URL to a class and the output of the class will be learning storm. By identifying the search keyword, we can easily identify the keywords users are searching to reach our site. Let's perform the following steps to calculate the keywords from the referrer URL:

  1. We are creating a KeywordGenerator class in the com.stormadvance.logprocessing package. This class contains logic to generate the search keyword from the referrer URL. The following is the source code of the KeywordGenerator class:
/** 
 * This class takes referrer URL as input, analyze the URL and return search 
 * keyword as output. 
 *  
 */ 
public class KeywordGenerator { 
 public String getKeyword(String referer) { 
 
       String[] temp; 
       Pattern pat = Pattern.compile("[?&#]q=([^&]+)"); 
       Matcher m...

Persisting the process data


This section will explain how we can persist the process data into a data store. We are using MySQL as a data store for the log processing use case. I am assuming you have MySQL installed on your centOS machine or you can follow the blog at http://www.rackspace.com/knowledge_center/article/installing-mysql-server-on-centos to install the MySQL on the centOS machine. Let's perform the following steps to persist the record into MySQL:

  1. Add the following dependency to pom.xml:
 
       <dependency> 
             <groupId>mysql</groupId> 
             <artifactId>mysql-connector-java</artifactId> 
             <version>5.1.6</version> 
       </dependency> 
  1. We are creating a MySQLConnection class in the com.stormadvance.logprocessing package. This class contains getMySQLConnection(String ip, String database, String user, String password) method, which returns the MySQL connection. The following is the source code of the...

Kafka spout and define topology


This section will explain how we can read the Apache log from a Kafka topic. This section also defines the LogProcessingTopology that will chain together all the bolts created in the preceding sections. Let's perform the following steps to consume the data from Kafka and define the topology:

  1. Add the following dependency and repository for Kafka in the pom.xml file:
       <dependency> 
             <groupId>org.apache.storm</groupId> 
             <artifactId>storm-kafka</artifactId> 
             <version>1.0.2</version> 
             <exclusions> 
                   <exclusion> 
                         <groupId>org.apache.kafka</groupId> 
                         <artifactId>kafka-clients</artifactId> 
                   </exclusion> 
             </exclusions> 
       </dependency> 
 
       <dependency> 
             <groupId>org.apache.kafka<...

Deploy topology


This section will explain how we can deploy the LogProcessingTopology. Perform the following steps:

  1. Execute the following command on the MySQL console to define the database schema:
mysql> create database apachelog; 
mysql> use apachelog; 
mysql> create table apachelog( 
       id INT NOT NULL AUTO_INCREMENT, 
       ip VARCHAR(100) NOT NULL, 
       dateTime VARCHAR(200) NOT NULL, 
       request VARCHAR(100) NOT NULL, 
       response VARCHAR(200) NOT NULL, 
       bytesSent VARCHAR(200) NOT NULL, 
        referrer VARCHAR(500) NOT NULL, 
       useragent VARCHAR(500) NOT NULL, 
       country VARCHAR(200) NOT NULL, 
       browser VARCHAR(200) NOT NULL, 
       os VARCHAR(200) NOT NULL, 
       keyword VARCHAR(200) NOT NULL, 
       PRIMARY KEY (id) 
 );
  1. I am assuming you have already produced some data on the apache_log topic by using Logstash.
  2. Go to the project home directory and run the following command to build the project:
> mvn clean install -DskipTests 
  1. Execute...

MySQL queries


This section will explain how we can analyze or query in store data to generate some statistics. We will cover the following:

  • Calculating the page hit from each country
  • Calculating the count of each browser
  • Calculating the count of each operating system

Calculate the page hit from each country

Run the following command on the MySQL console to calculate the page hit from each country:

mysql> select country, count(*) from apachelog group by country; 
+---------------------------+----------+ 
| country                   | count(*) | 
+---------------------------+----------+ 
| Asia/Pacific Region       |        9 | 
| Belarus                   |       12 | 
| Belgium                   |       12 | 
| Bosnia and Herzegovina    |       12 | 
| Brazil                    |       36 | 
| Bulgaria                  |       12 | 
| Canada                    |      218 | 
| Europe                    |       24 | 
| France                    |       44 | 
| Germany                   |    ...

Summary


In this chapter, we introduced you to how we can process the Apache log file, how we can identify the country name from the IP, how we can identify the user operating system and browser by analyzing the log file, and how we can identify the search keyword by analyzing the referrer field.

In the next chapter, we will learn how we can solve machine learning problems through Storm.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Apache Storm
Published in: Aug 2017Publisher: ISBN-13: 9781787125636
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain