Home Data Elasticsearch for Hadoop

Elasticsearch for Hadoop

By Vishal Shukla
books-svg-icon Book
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
About this book
Publication date:
October 2015
Publisher
Packt
Pages
222
ISBN
9781785288999

 

Chapter 1. Setting Up Environment

The goal of this book is to get you up and running with ES-Hadoop and enable you to solve real-world analytics problems. To take the first step towards doing this, we will start with setting up Hadoop, Elasticsearch, and related toolsets, which you will use throughout the rest of the book.

We encourage you to try the examples in the book (along with reading) to speed up the learning process.

In this chapter, we will cover the following topics:

  • Setting up Hadoop in pseudo-distributed mode

  • Setting up Elasticsearch and its related plugins

  • Running the WordCount example

  • Exploring data in Marvel and Head

 

Setting up Hadoop for Elasticsearch


For our exploration on Hadoop and Elasticsearch, we will use an Ubuntu-based host. However, you may opt to run any other Linux OS and set up Hadoop and Elasticsearch.

Being a Hadoop user, if you already have Hadoop set up in your local machine, you may jump directly to the section, Setting up Elasticsearch.

Hadoop supports three cluster modes: the stand-alone mode, the pseudo-distributed mode, and the fully-distributed mode. To make it good enough to walk through the examples of the book, we will consider the pseudo-distributed mode on a Linux operating system. This mode will ensure that without getting into the complexity of setting up so many nodes, we will mirror the components in such a way that they behave no differently to the real production environment. In the pseudo-distributed mode, each component runs on its own JVM process.

Setting up Java

The examples in this book are developed and tested against Oracle Java 1.8. These examples should run fine with other distributions of Java 8 as well.

In order to set up Oracle Java 8, open the terminal and execute the following steps:

  1. First, add and update the repository for Java 8 with the following command:

    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    
  2. Next, install Java 8 and configure the environment variables, as shown in the following command:

    $ sudo apt-get install oracle-java8-set-default
    
  3. Now, verify the installation as follows:

    $ java -version
    

    This should show an output similar to the following code; it may vary a bit based on the exact version:

    java version "1.8.0_60"
    Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
    Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
    

Setting up a dedicated user

To ensure that our ES-Hadoop environment is clean and isolated from the rest of the applications and to be able to manage security and permissions easily, we will set up a dedicated user. Perform the following steps:

  1. First, add the hadoop group with the following command:

    $ sudo addgroup hadoop
    
  2. Then, add the eshadoop user to the hadoop group, as shown in the following command:

    $ sudo adduser eshadoop hadoop
    
  3. Finally, add the eshadoop user to the sudoers list by adding the user to the sudo group as follows:

    $ sudo adduser eshadoop sudo
    

Now, you need to relogin with the eshadoop user to execute further steps.

Installing SSH and setting up the certificate

In order to manage nodes, Hadoop requires an SSH access, so let's install and run the SSH. Perform the following steps:

  1. First, install ssh with the following command:

    $ sudo apt-get install ssh 
    
  2. Then, generate a new SSH key pair using the ssh-keygen utility, by using the following command:

    $ ssh-keygen -t rsa -P ''  -C email@example.com
    

    Note

    You must use the default settings when asked for Enter file in which to save the key. By default, it should generate the key pair under the /home/eshadoop/.ssh folder.

  3. Now, confirm the key generation by issuing the following command. This command should display at least a couple of files with id_rsa and id_rsa.pub. We just created an RSA key pair with an empty password so that Hadoop can interact with nodes without the need to enter the passphrase:

    $ ls -l ~/.ssh
    
  4. To enable the SSH access to your local machine, you need to specify that the newly generated public key is an authorized key to log in using the following command:

    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    
  5. Finally, do not forget to test the password-less ssh using the following command:

    $ ssh localhost
    

Downloading Hadoop

Using the following commands, download Hadoop and extract the file to /usr/local so that it is available for other users as well. Perform the following steps:

  1. First, download the Hadoop tarball by running the following command:

    $ wget http://ftp.wayne.edu/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
    
  2. Next, extract the tarball to the /usr/local directory with the following command:

    $ sudo tar vxzf hadoop-2.6.0.tar.gz -C /usr/local
    

    Note

    Note that extracting it to /usr/local will affect other users as well. In other words, it will be available to other users as well, assuming that appropriate permissions are provided for the directory.

  3. Now, rename the Hadoop directory using the following command:

    $ cd /usr/local
    $ sudo mv hadoop-2.6.0 hadoop
    
  4. Finally, change the owner of all the files to the eshadoop user and the hadoop group with the following command:

    $ sudo chown -R eshadoop:hadoop hadoop
    

Setting up environment variables

The next step is to set up environment variables. You can do so by exporting the required variables to the .bashrc file for the user.

Open the .bashrc file using any editor of your choice, then add the following export declarations to set up our environment variables:

#Set JAVA_HOME 
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

#Set Hadoop related environment variable
export HADOOP_INSTALL=/usr/local/hadoop

#Add bin and sbin directory to PATH
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin

#Set few more Hadoop related environment variable
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Once you have saved the .bashrc file, you can relogin to have your new environment variables visible, or you can source the .bashrc file using the following command:

$ source ~/.bashrc

Configuring Hadoop

Now, we need to set up the JAVA_HOME environment variable in the hadoop-env.sh file that is used by Hadoop. You can find it in $HADOOP_INSTALL/etc/hadoop.

Next, change the JAVA_HOME path to reflect to your Java installation directory. On my machine, it looks similar to the following:

$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Now, let's relogin and confirm the configuration using the following command:

$ hadoop version

As you know, we will set up our Hadoop environment in a pseudo-distributed mode. In this mode, each Hadoop daemon runs in a separate Java process. The next step is to configure these daemons. So, let's switch to the following folder that contains all the Hadoop configuration files:

$ cd $HADOOP_INSTALL/etc/hadoop

Configuring core-site.xml

The configuration of core-site.xml will set up the temporary directory for Hadoop and the default filesystem. In our case, the default filesystem refers to the NameNode. Let's change the content of the <configuration> section of core-site.xml so that it looks similar to the following code:

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/eshadoop/hdfs/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
</configuration>

Configuring hdfs-site.xml

Now, we will configure the replication factor for HDFS files. To set the replication to 1, change the content of the <configuration> section of hdfs-site.xml so that it looks similar to the following code:

<configuration>
   <property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
</configuration>

Note

We will run Hadoop in the pseudo-distributed mode. In order to do this, we need to configure the YARN resource manager. YARN handles the resource management and scheduling responsibilities in the Hadoop cluster so that the data processing and data storage components can focus on their respective tasks.

Configuring yarn-site.xml

Configure yarn-site.xml in order to configure the auxiliary service name and classes, as shown in the following code:

<configuration>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Configuring mapred-site.xml

Hadoop provides mapred-site.xml.template, which you can rename to mapred-site.xml and change the content of the <configuration> section to the following code; this will ensure that the MapReduce jobs run on YARN as opposed to running them in-process locally:

<configuration>  
  <property>
    <name>mapred.job.tracker</name>
    <value>yarn</value>
  </property>
</configuration>

The format distributed filesystem

We have already configured all the Hadoop daemons, including HDFS, YARN, and the JobTracker. You may already be aware that HDFS relies on NameNode and DataNodes. NameNode contains the storage-related metadata, whereas DataNode stores the real data in the form of blocks. When you set up your Hadoop cluster, it is required to format NameNode before you can start using HDFS. We can do so with the following command:

$ hadoop namenode -format

Note

If you were already using the data nodes of HDFS, do not format the name node unless you know what you are doing. When you format NameNode, you will lose all the storage metadata, just as the blocks are distributed among DataNodes. This means that although you didn't physically remove the data from DataNodes, the data will be inaccessible to you. Therefore, it is always good to remove the data in DataNodes when you format the NameNode.

Starting Hadoop daemons

Now, we have all the prerequisites set up along with all the Hadoop daemons. In order to run our first MapReduce job, we need all the required Hadoop daemons running.

Let's start with HDFS using the following command. This command starts the NameNode, SecondaryNameNode, and DataNode daemons:

$ start-dfs.sh

The next step is to start the YARN resource manager using the following command (YARN will start the ResourceManager and NodeManager daemons):

$ start-yarn.sh

If the preceding two commands were successful in starting HDFS and YARN, you should be able to check the running daemons using the jps tool (this tool lists the running JVM process on your machine):

$ jps

If everything worked successfully, you should see the following services running:

13386 SecondaryNameNode
13059 NameNode
13179 DataNode
17490 Jps
13649 NodeManager
13528 ResourceManager
 

Setting up Elasticsearch


In this section, we will download and configure the Elasticsearch server and install the Elasticsearch Head and Marvel plugins.

Downloading Elasticsearch

To download Elasticsearch, perform the following steps:

  1. First, download Elasticsearch using the following command:

    $ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.tar.gz
    
  2. Once the file is downloaded, extract it to /usr/local and rename it with a convenient name, using the following command:

    $ sudo tar -xvzf elasticsearch-1.7.1.tar.gz -C /usr/local
    $ sudo mv /usr/local/elasticsearch-1.7.1 /usr/local/elasticsearch
    
  3. Then, set the eshadoop user as the owner of the directory as follows:

    $ sudo chown -R eshadoop:hadoop /usr/local/elasticsearch
    

Configuring Elasticsearch

The Elasticsearch configuration file, elasticsearch.yml, can be located in the config folder under the Elasticsearch home directory. Open the elasticsearch.yml file in the editor of your choice by using the following command:

$ cd /usr/local/elasticsearch
$ vi config/elasticsearch.yml

Uncomment the line with the cluster.name key from the elasticsearch.yml file and change the cluster name, as shown in the following code:

  cluster.name:eshadoopcluster

Similarly, uncomment the line with the node.name key and change the value as follows:

node.name:"ES Hadoop Node"

Note

Elasticsearch comes with a decent default configuration to let you start the nodes with zero additional configurations. In a production environment and even in a development environment, sometimes it may be desirable to tweak some configurations.

By default, Elasticsearch assigns the node name from the randomly picked Marvel character name from a list of 3,000 names. The default cluster name assigned to the node is elasticsearch. With the default configurations of ES nodes in the same network and the same cluster name, Elasticsearch will synchronize the data between the nodes. This may be unwanted if each developer is looking for an isolated ES server setup. It's always good to specify cluster.name and node.name to avoid unwanted surprises.

You can change the defaults for configurations starting with path.*. To set up the directories that store the server data, to locate paths section, and to uncomment the highlighted paths and changes, use the following code:

 ########################### paths #############################
 # Path to directory containing configuration (this file and logging.yml):
#
path.conf: /usr/local/elasticsearch/config

# Path to directory where to store index data allocated for this node.
# 
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. 
path.data: /usr/local/elasticsearch/data


# Path to temporary files:
#
path.work: /usr/local/elasticsearch/work

# Path to log files:
#
path.logs: /usr/local/elasticsearch/logs

Note

It's important to choose the location of path.data wisely. In production, you should make sure that this path doesn't exist in the Elasticsearch installation directory in order to avoid accidently overwriting or deleting the data when upgrading Elasticsearch.

Installing Elasticsearch's Head plugin

Elasticsearch provides a plugin utility to install the Elasticsearch plugins. Execute the following command to install the Head plugin:

$ bin/plugin -install mobz/elasticsearch-head

-> Installing mobz/elasticsearch-head...
Trying https://github.com/mobz/elasticsearch-head/archive/master.zip...
Downloading ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE
Installed mobz/elasticsearch-head into /usr/local /elasticsearch/plugins/head
Identified as a _site plugin, moving to _site structure ...

As indicated by the console output, the plugin is successfully installed in the default plugins directory under the Elasticsearch home. You can access the head plugin at http://localhost:9200/_plugin/head/.

Installing the Marvel plugin

Now, let's install the Marvel plugin using a similar command:

$ bin/plugin -i elasticsearch/marvel/latest

-> Installing elasticsearch/marvel/latest...
Trying http://download.elasticsearch.org/elasticsearch/marvel/marvel-latest.zip...
Downloading ................................................................................................................................................................................................................................................................................................................................DONE
Installed elasticsearch/marvel/latest into /usr/local/elasticsearch/plugins/marvel

Running and testing

Finally, start Elasticsearch using the following command:

$ ./bin/elasticsearch

We will then get the following log:

[2015-05-13 21:59:37,344][INFO ][node                     ] [ES Hadoop Node] version[1.5.1], pid[3822], build[5e38401/2015-04-09T13:41:35Z]
[2015-05-13 21:59:37,346][INFO ][node                     ] [ES Hadoop Node] initializing ...
[2015-05-13 21:59:37,358][INFO ][plugins                  ] [ES Hadoop Node] loaded [marvel], sites [marvel, head]
[2015-05-13 21:59:39,956][INFO ][node                     ] [ES Hadoop Node] initialized
[2015-05-13 21:59:39,959][INFO ][node                     ] [ES Hadoop Node] starting ...
[2015-05-13 21:59:40,133][INFO ][transport                ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.0.2.15:9300]}
[2015-05-13 21:59:40,159][INFO ][discovery                ] [ES Hadoop Node] eshadoopcluster/_bzqXWbLSXKXWpafHaLyRA
[2015-05-13 21:59:43,941][INFO ][cluster.service          ] [ES Hadoop Node] new_master [ES Hadoop Node][_bzqXWbLSXKXWpafHaLyRA][eshadoop][inet[/10.0.2.15:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-13 21:59:43,989][INFO ][http                     ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.2.15:9200]}
[2015-05-13 21:59:43,989][INFO ][node                     ] [ES Hadoop Node] started
[2015-05-13 21:59:44,026][INFO ][gateway                  ] [ES Hadoop Node] recovered [0] indices into cluster_state
[2015-05-13 22:00:00,707][INFO ][cluster.metadata         ] [ES Hadoop Node] [.marvel-2015.05.13] creating index, cause [auto(bulk api)], templates [marvel], shards [1]/[1], mappings [indices_stats, cluster_stats, node_stats, shard_event, node_event, index_event, index_stats, _default_, cluster_state, cluster_event, routing_event]
[2015-05-13 22:00:01,421][INFO ][cluster.metadata         ] [ES Hadoop Node] [.marvel-2015.05.13] update_mapping [node_stats] (dynamic)

The startup logs will give you some useful hints as to what is going on. By default, Elasticsearch uses the transport ports from 9200 to 9299 for HTTP, allocating the first port that is available for the node. In the highlighted output, you can also see that it binds to the port 9300 as well. Elasticsearch uses the port range from 9300 to 9399 for an internal node-to-node communication or when communicating using the Java client. It can use the zen multicast or the unicast ping discovery to find other nodes in the cluster with multicast as the default. We will understand more about these discovery nodes in later chapters.

 

Running the WordCount example


Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?

Getting the examples and building the job JAR file

You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme file in the source code zip. The build process should generate a ch01-0.0.1-job.jar file under the <SOURCE_CODE_BASE_DIR>/ch01/target directory.

Importing the test file to HDFS

For our WordCount example, you can use any text file of your choice. To explain the example, we will use the sample.txt file that is part of the source zip. Perform the following steps:

  1. First, let's create a nice directory structure in HDFS to manage our input files with the following command:

    $ hadoop fs -mkdir /input
    $ hadoop fs -mkdir /input/ch01
    
  2. Next, upload the sample.txt file to HDFS at the desired location, by using the following command:

    $ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt 
    
  3. Now, verify that the file is successfully imported to HDFS by using the following command:

    $ hadoop fs -ls /input/ch01
    

    Finally, when you execute the preceding command, it should show an output similar to the following code:

    Found 1 items
    -rw-r--r--   1 eshadoop supergroup       2803 2015-05-10 15:18 /input/ch01/sample.txt 
    

Running our first job

We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target directory and run the following command:

$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt  

Now you'll get the following output:

15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption
15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732]
15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount]
15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002
15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002
15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/
15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002
15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false
15/05/10 15:21:54 INFO mapreduce.Job:  map 0% reduce 0%
15/05/10 15:22:01 INFO mapreduce.Job:  map 100% reduce 0%
15/05/10 15:22:09 INFO mapreduce.Job:  map 100% reduce 100%
15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully



  Elasticsearch Hadoop Counters
    Bulk Retries=0
    Bulk Retries Total Time(ms)=0
    Bulk Total=1
    Bulk Total Time(ms)=48
    Bytes Accepted=9655
    Bytes Received=4000
    Bytes Retried=0
    Bytes Sent=9655
    Documents Accepted=232
    Documents Received=0
    Documents Retried=0
    Documents Sent=232
    Network Retries=0
    Network Total Time(ms)=84
    Node Retries=0
    Scroll Total=0
    Scroll Total Time(ms)=0

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster.

 

Exploring data in Head and Marvel


In the previous section, we set up a couple of plugins: Head and Marvel. In this section, we will have a bird's eye view of how to use these plugins to explore the Elasticsearch documents that we just imported by running the ES-Hadoop MapReduce job.

Viewing data in Head

The Elasticsearch Head plugin provides a simple web frontend to visualize the Elasticsearch indices, cluster, node health, and statistics. It provides an easy-to-use interface to explore index, types, and documents with the query building interface. It also allows you to view the documents of Elasticsearch in a table-like structure as well, which can be quite handy for users coming from the RDBMS background.

Here is how the Elasticsearch Head home page looks when you open http://localhost:9200/_plugin/head.

The following image shows the home page of the Elasticsearch Head plugin:

You will get a quick insight into your cluster from the preceding screenshot, such as what is cluster health is (Green, Yellow, or Red); how the shards are allocated to different nodes, which indices exist in the cluster, what the size is of each index, and so on. For example, in the preceding screenshot, we can see two indices: .marvel-2015.05.10 and eshadoop. You may be surprised that we never created an index with the name of .marvel-2015.05.10. You can ignore this index for the time being; we will take a brief look at it in the next subsection.

Let's go back to our WordCount example. You can see that the document count for the eshadoop index in the preceding screenshot exactly matches with the number of documents metric indicated by the MapReduce job output that we saw in the last section.

The following diagram shows the Browser tab of the Elasticsearch Head plugin:

To take a look at the documents, navigate to the Browser tab. You can see that the screen is similar to the one shown in the preceding screenshot. You can click on the eshadoop index on the left-hand side under the Indices heading and sort the results by count to see the relevant documents. You can also see that the output of the MapReduce job is pushed directly to Elasticsearch. Further more, you can see the ES document fields, such as _index, _type, _id, and _score, along with the fields that we are interested in word and count. You may want to sort the results based on count by clicking on the count column to see the most frequent words in the sample.txt file.

Using the Marvel dashboard

Marvel is a monitoring dashboard for real-time and historical analysis that is built on top of Kibana: a data visualization tool for ES-Hadoop. This dashboard provides, insight into the different metrics of the node, JVM, and ES-Hadoop internals. To open the Marvel dashboard, refer to your browser at http://localhost:9200/_plugin/marvel/.

The following screenshot gives you an overview of the Marvel dashboard:

You can see the different real-time metrics for your cluster, nodes, and indices. You can visualize the trends of the document count, search, and the indexing request rates in a graphical way. This kind of visualization may be helpful to get a quick insight into the usage pattern of the index and find out the candidates for the purpose of performance optimization. It displays the vital monitoring stats, such as the CPU usage, the load, the JVM memory usage, the free disk space, and so on. You can also filter by time range in the top-right corner to use the dashboard for historical analysis. Marvel stores these historical data in a separate daily rolling index with a name pattern, such as .marvel-XXX.

Exploring the data in Sense

Sense is a plugin embedded in Marvel to provide a seamless and easy-to-use REST API client for the ES-Hadoop server. It is Elasticsearch-aware and frees you from memorizing the ES-Hadoop query syntaxes by providing autosuggestions. It also helps by indicating the typo or syntax errors.

To open the Sense user interface, open http://localhost:9200/_plugin/marvel/sense/index.html in your browser.

The following screenshot shows the query interface of Sense:

Now, let's find out the documents imported in the eshadoop index by executing the match_all query.

Then, use the following query in the query panel on the left-hand side in the sense interface:

GET eshadoop/_search
{
   "query": {
        "match_all":{}
    }
}

Finally, click on the Send request button to execute the query and obtain the results.

Note

You can point to different Elasticsearch servers if you wish by changing the server field at the top.

 

Summary


In this chapter, we started by checking the prerequisites for how to install Hadoop and configured Hadoop in the pseudo-distributed mode. Then, we got the Elasticsearch server up and running and understood the basic configurations of Elasticsearch. We learned how to install the Elasticsearch plugins. We imported the sample file for the WordCount example to HDFS and successfully ran our first Hadoop MapReduce job that uses ES-Hadoop to get the data to Elasticsearch. Then we learned how to use the Head and Marvel plugins to explore documents in Elasticsearch.

With our environment and the required tools set up with a basic understanding, we are all set to have a hands-on experience of how to write MapReduce jobs that use ES-Hadoop. In the next chapter, we will take a look at how the WordCount job is developed. We will also develop a couple of jobs for real-world scenarios that will write and read data to and from HDFS and Elasticsearch.

About the Author
  • Vishal Shukla

    Vishal Shukla is the CEO of Brevitaz Systems (http://brevitaz.com) and a technology evangelist at heart. He is a passionate software scientist and a big data expert. Vishal has extensive experience in designing modular enterprise systems. Since his college days (more than 11 years), Vishal has enjoyed coding in JVM-based languages. He also embraces design thinking and sustainable software development. He has vast experience in architecting enterprise systems in various domains. Vishal is deeply interested in technologies related to big data engineering, analytics, and machine learning. He set up Brevitaz Systems. This company delivers massively scalable and sustainable big data and analytics-based enterprise applications to their global clientele. With varied expertise in big data technologies and architectural acumen, the Brevitaz team successfully developed and re-engineered a number of legacy systems to state-of-the-art scalable systems. Brevitaz has imbibed in its culture agile practices, such as scrum, test-driven development, continuous integration, and continuous delivery, to deliver high-quality products to its clients. Vishal is a music and art lover. He loves to sing, play musical instruments, draw portraits, and play sports, such as cricket, table tennis, and pool, in his free time. You can contact Vishal at vishal.shukla@brevitaz.com and on LinkedIn at https://in.linkedin.com/in/vishalshu. You can also follow Vishal on Twitter at @vishal1shukla2.

    Browse publications by this author
Latest Reviews (2 reviews total)
Great book about a great technology. A must have for everyone interested in or working with Elastic
Elasticsearch for Hadoop
Unlock this book and the full library FREE for 7 days
Start now