The goal of this book is to get you up and running with ES-Hadoop and enable you to solve real-world analytics problems. To take the first step towards doing this, we will start with setting up Hadoop, Elasticsearch, and related toolsets, which you will use throughout the rest of the book.
We encourage you to try the examples in the book (along with reading) to speed up the learning process.
In this chapter, we will cover the following topics:
Setting up Hadoop in pseudo-distributed mode
Setting up Elasticsearch and its related plugins
Running the
WordCount
exampleExploring data in Marvel and Head
For our exploration on Hadoop and Elasticsearch, we will use an Ubuntu-based host. However, you may opt to run any other Linux OS and set up Hadoop and Elasticsearch.
Being a Hadoop user, if you already have Hadoop set up in your local machine, you may jump directly to the section, Setting up Elasticsearch.
Hadoop supports three cluster modes: the stand-alone mode, the pseudo-distributed mode, and the fully-distributed mode. To make it good enough to walk through the examples of the book, we will consider the pseudo-distributed mode on a Linux operating system. This mode will ensure that without getting into the complexity of setting up so many nodes, we will mirror the components in such a way that they behave no differently to the real production environment. In the pseudo-distributed mode, each component runs on its own JVM process.
The examples in this book are developed and tested against Oracle Java 1.8. These examples should run fine with other distributions of Java 8 as well.
In order to set up Oracle Java 8, open the terminal and execute the following steps:
First, add and update the repository for Java 8 with the following command:
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update
Next, install Java 8 and configure the environment variables, as shown in the following command:
$ sudo apt-get install oracle-java8-set-default
Now, verify the installation as follows:
$ java -version
This should show an output similar to the following code; it may vary a bit based on the exact version:
java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
To ensure that our ES-Hadoop environment is clean and isolated from the rest of the applications and to be able to manage security and permissions easily, we will set up a dedicated user. Perform the following steps:
First, add the
hadoop
group with the following command:$ sudo addgroup hadoop
Then, add the
eshadoop
user to thehadoop
group, as shown in the following command:$ sudo adduser eshadoop hadoop
Finally, add the
eshadoop
user to thesudoers
list by adding the user to thesudo
group as follows:$ sudo adduser eshadoop sudo
Now, you need to relogin with the eshadoop
user to execute further steps.
In order to manage nodes, Hadoop requires an SSH access, so let's install and run the SSH. Perform the following steps:
First, install
ssh
with the following command:$ sudo apt-get install ssh
Then, generate a new SSH key pair using the
ssh-keygen
utility, by using the following command:$ ssh-keygen -t rsa -P '' -C email@example.com
Now, confirm the key generation by issuing the following command. This command should display at least a couple of files with
id_rsa
andid_rsa.pub
. We just created an RSA key pair with an empty password so that Hadoop can interact with nodes without the need to enter the passphrase:$ ls -l ~/.ssh
To enable the SSH access to your local machine, you need to specify that the newly generated public key is an authorized key to log in using the following command:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Finally, do not forget to test the password-less
ssh
using the following command:$ ssh localhost
Using the following commands, download Hadoop and extract the file to /usr/local
so that it is available for other users as well. Perform the following steps:
First, download the Hadoop tarball by running the following command:
$ wget http://ftp.wayne.edu/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Next, extract the tarball to the
/usr/local
directory with the following command:$ sudo tar vxzf hadoop-2.6.0.tar.gz -C /usr/local
Now, rename the Hadoop directory using the following command:
$ cd /usr/local $ sudo mv hadoop-2.6.0 hadoop
Finally, change the owner of all the files to the
eshadoop
user and thehadoop
group with the following command:$ sudo chown -R eshadoop:hadoop hadoop
The next step is to set up environment variables. You can do so by exporting the required variables to the .bashrc
file for the user.
Open the .bashrc
file using any editor of your choice, then add the following export declarations to set up our environment variables:
#Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-8-oracle #Set Hadoop related environment variable export HADOOP_INSTALL=/usr/local/hadoop #Add bin and sbin directory to PATH export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin #Set few more Hadoop related environment variable export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Once you have saved the .bashrc
file, you can relogin to have your new environment variables visible, or you can source the .bashrc
file using the following command:
$ source ~/.bashrc
Now, we need to set up the JAVA_HOME
environment variable in the hadoop-env.sh
file that is used by Hadoop. You can find it in $HADOOP_INSTALL/etc/hadoop
.
Next, change the JAVA_HOME
path to reflect to your Java installation directory. On my machine, it looks similar to the following:
$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle
Now, let's relogin and confirm the configuration using the following command:
$ hadoop version
As you know, we will set up our Hadoop environment in a pseudo-distributed mode. In this mode, each Hadoop daemon runs in a separate Java process. The next step is to configure these daemons. So, let's switch to the following folder that contains all the Hadoop configuration files:
$ cd $HADOOP_INSTALL/etc/hadoop
The configuration of core-site.xml
will set up the temporary directory for Hadoop and the default filesystem. In our case, the default filesystem refers to the NameNode. Let's change the content of the <configuration>
section of core-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/eshadoop/hdfs/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Now, we will configure the replication factor for HDFS files. To set the replication to 1
, change the content of the <configuration>
section of hdfs-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Note
We will run Hadoop in the pseudo-distributed mode. In order to do this, we need to configure the YARN resource manager. YARN handles the resource management and scheduling responsibilities in the Hadoop cluster so that the data processing and data storage components can focus on their respective tasks.
Configure yarn-site.xml
in order to configure the auxiliary service name and classes, as shown in the following code:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
Hadoop provides mapred-site.xml.template
, which you can rename to mapred-site.xml
and change the content of the <configuration>
section to the following code; this will ensure that the MapReduce jobs run on YARN as opposed to running them in-process locally:
<configuration> <property> <name>mapred.job.tracker</name> <value>yarn</value> </property> </configuration>
We have already configured all the Hadoop daemons, including HDFS, YARN, and the JobTracker. You may already be aware that HDFS relies on NameNode and DataNodes. NameNode contains the storage-related metadata, whereas DataNode stores the real data in the form of blocks. When you set up your Hadoop cluster, it is required to format NameNode before you can start using HDFS. We can do so with the following command:
$ hadoop namenode -format
Note
If you were already using the data nodes of HDFS, do not format the name node unless you know what you are doing. When you format NameNode, you will lose all the storage metadata, just as the blocks are distributed among DataNodes. This means that although you didn't physically remove the data from DataNodes, the data will be inaccessible to you. Therefore, it is always good to remove the data in DataNodes when you format the NameNode.
Now, we have all the prerequisites set up along with all the Hadoop daemons. In order to run our first MapReduce job, we need all the required Hadoop daemons running.
Let's start with HDFS using the following command. This command starts the NameNode, SecondaryNameNode, and DataNode daemons:
$ start-dfs.sh
The next step is to start the YARN resource manager using the following command (YARN will start the ResourceManager and NodeManager daemons):
$ start-yarn.sh
If the preceding two commands were successful in starting HDFS and YARN, you should be able to check the running daemons using the jps
tool (this tool lists the running JVM process on your machine):
$ jps
If everything worked successfully, you should see the following services running:
13386 SecondaryNameNode 13059 NameNode 13179 DataNode 17490 Jps 13649 NodeManager 13528 ResourceManager
In this section, we will download and configure the Elasticsearch server and install the Elasticsearch Head and Marvel plugins.
To download Elasticsearch, perform the following steps:
First, download Elasticsearch using the following command:
$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.tar.gz
Once the file is downloaded, extract it to
/usr/local
and rename it with a convenient name, using the following command:$ sudo tar -xvzf elasticsearch-1.7.1.tar.gz -C /usr/local $ sudo mv /usr/local/elasticsearch-1.7.1 /usr/local/elasticsearch
Then, set the
eshadoop
user as the owner of the directory as follows:$ sudo chown -R eshadoop:hadoop /usr/local/elasticsearch
The Elasticsearch configuration file, elasticsearch.yml
, can be located in the config
folder under the Elasticsearch home directory. Open the elasticsearch.yml
file in the editor of your choice by using the following command:
$ cd /usr/local/elasticsearch $ vi config/elasticsearch.yml
Uncomment the line with the cluster.name
key from the elasticsearch.yml
file and change the cluster name, as shown in the following code:
cluster.name:eshadoopcluster
Similarly, uncomment the line with the node.name
key and change the value as follows:
node.name:"ES Hadoop Node"
Note
Elasticsearch comes with a decent default configuration to let you start the nodes with zero additional configurations. In a production environment and even in a development environment, sometimes it may be desirable to tweak some configurations.
By default, Elasticsearch assigns the node name from the randomly picked Marvel character name from a list of 3,000 names. The default cluster name assigned to the node is elasticsearch
. With the default configurations of ES nodes in the same network and the same cluster name, Elasticsearch will synchronize the data between the nodes. This may be unwanted if each developer is looking for an isolated ES server setup. It's always good to specify cluster.name
and node.name
to avoid unwanted surprises.
You can change the defaults for configurations starting with path.*
. To set up the directories that store the server data, to locate paths section, and to uncomment the highlighted paths and changes, use the following code:
########################### paths ############################# # Path to directory containing configuration (this file and logging.yml): # path.conf: /usr/local/elasticsearch/config # Path to directory where to store index data allocated for this node. # # Can optionally include more than one location, causing data to be striped across # the locations (a la RAID 0) on a file level, favouring locations with most free # space on creation. path.data: /usr/local/elasticsearch/data # Path to temporary files: # path.work: /usr/local/elasticsearch/work # Path to log files: # path.logs: /usr/local/elasticsearch/logs
Elasticsearch provides a plugin utility to install the Elasticsearch plugins. Execute the following command to install the Head plugin:
$ bin/plugin -install mobz/elasticsearch-head -> Installing mobz/elasticsearch-head... Trying https://github.com/mobz/elasticsearch-head/archive/master.zip... Downloading ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE Installed mobz/elasticsearch-head into /usr/local /elasticsearch/plugins/head Identified as a _site plugin, moving to _site structure ...
As indicated by the console output, the plugin is successfully installed in the default plugins directory under the Elasticsearch home. You can access the head plugin at http://localhost:9200/_plugin/head/
.
Now, let's install the Marvel plugin using a similar command:
$ bin/plugin -i elasticsearch/marvel/latest -> Installing elasticsearch/marvel/latest... Trying http://download.elasticsearch.org/elasticsearch/marvel/marvel-latest.zip... Downloading ................................................................................................................................................................................................................................................................................................................................DONE Installed elasticsearch/marvel/latest into /usr/local/elasticsearch/plugins/marvel
Finally, start Elasticsearch using the following command:
$ ./bin/elasticsearch
We will then get the following log:
[2015-05-13 21:59:37,344][INFO ][node ] [ES Hadoop Node] version[1.5.1], pid[3822], build[5e38401/2015-04-09T13:41:35Z] [2015-05-13 21:59:37,346][INFO ][node ] [ES Hadoop Node] initializing ... [2015-05-13 21:59:37,358][INFO ][plugins ] [ES Hadoop Node] loaded [marvel], sites [marvel, head] [2015-05-13 21:59:39,956][INFO ][node ] [ES Hadoop Node] initialized [2015-05-13 21:59:39,959][INFO ][node ] [ES Hadoop Node] starting ... [2015-05-13 21:59:40,133][INFO ][transport ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.0.2.15:9300]} [2015-05-13 21:59:40,159][INFO ][discovery ] [ES Hadoop Node] eshadoopcluster/_bzqXWbLSXKXWpafHaLyRA [2015-05-13 21:59:43,941][INFO ][cluster.service ] [ES Hadoop Node] new_master [ES Hadoop Node][_bzqXWbLSXKXWpafHaLyRA][eshadoop][inet[/10.0.2.15:9300]], reason: zen-disco-join (elected_as_master) [2015-05-13 21:59:43,989][INFO ][http ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.2.15:9200]} [2015-05-13 21:59:43,989][INFO ][node ] [ES Hadoop Node] started [2015-05-13 21:59:44,026][INFO ][gateway ] [ES Hadoop Node] recovered [0] indices into cluster_state [2015-05-13 22:00:00,707][INFO ][cluster.metadata ] [ES Hadoop Node] [.marvel-2015.05.13] creating index, cause [auto(bulk api)], templates [marvel], shards [1]/[1], mappings [indices_stats, cluster_stats, node_stats, shard_event, node_event, index_event, index_stats, _default_, cluster_state, cluster_event, routing_event] [2015-05-13 22:00:01,421][INFO ][cluster.metadata ] [ES Hadoop Node] [.marvel-2015.05.13] update_mapping [node_stats] (dynamic)
The startup logs will give you some useful hints as to what is going on. By default, Elasticsearch uses the transport ports from 9200
to 9299
for HTTP, allocating the first port that is available for the node. In the highlighted output, you can also see that it binds to the port 9300
as well. Elasticsearch uses the port range from 9300
to 9399
for an internal node-to-node communication or when communicating using the Java client. It can use the zen multicast or the unicast ping discovery to find other nodes in the cluster with multicast as the default. We will understand more about these discovery nodes in later chapters.
Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount
example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?
You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme
file in the source code zip. The build process should generate a ch01-0.0.1-job.jar
file under the <SOURCE_CODE_BASE_DIR>/ch01/target
directory.
For our WordCount
example, you can use any text file of your choice. To explain the example, we will use the sample.txt
file that is part of the source zip. Perform the following steps:
First, let's create a nice directory structure in HDFS to manage our input files with the following command:
$ hadoop fs -mkdir /input $ hadoop fs -mkdir /input/ch01
Next, upload the
sample.txt
file to HDFS at the desired location, by using the following command:$ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt
Now, verify that the file is successfully imported to HDFS by using the following command:
$ hadoop fs -ls /input/ch01
Finally, when you execute the preceding command, it should show an output similar to the following code:
Found 1 items -rw-r--r-- 1 eshadoop supergroup 2803 2015-05-10 15:18 /input/ch01/sample.txt
We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target
directory and run the following command:
$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt
Now you'll get the following output:
15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption 15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732] 15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount] 15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002 15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002 15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/ 15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002 15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false 15/05/10 15:21:54 INFO mapreduce.Job: map 0% reduce 0% 15/05/10 15:22:01 INFO mapreduce.Job: map 100% reduce 0% 15/05/10 15:22:09 INFO mapreduce.Job: map 100% reduce 100% 15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully … … … Elasticsearch Hadoop Counters Bulk Retries=0 Bulk Retries Total Time(ms)=0 Bulk Total=1 Bulk Total Time(ms)=48 Bytes Accepted=9655 Bytes Received=4000 Bytes Retried=0 Bytes Sent=9655 Documents Accepted=232 Documents Received=0 Documents Retried=0 Documents Sent=232 Network Retries=0 Network Total Time(ms)=84 Node Retries=0 Scroll Total=0 Scroll Total Time(ms)=0
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt
file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster
.
In the previous section, we set up a couple of plugins: Head and Marvel. In this section, we will have a bird's eye view of how to use these plugins to explore the Elasticsearch documents that we just imported by running the ES-Hadoop MapReduce job.
The Elasticsearch Head plugin provides a simple web frontend to visualize the Elasticsearch indices, cluster, node health, and statistics. It provides an easy-to-use interface to explore index, types, and documents with the query building interface. It also allows you to view the documents of Elasticsearch in a table-like structure as well, which can be quite handy for users coming from the RDBMS background.
Here is how the Elasticsearch Head home page looks when you open http://localhost:9200/_plugin/head
.
The following image shows the home page of the Elasticsearch Head plugin:

You will get a quick insight into your cluster from the preceding screenshot, such as what is cluster health is (Green, Yellow, or Red); how the shards are allocated to different nodes, which indices exist in the cluster, what the size is of each index, and so on. For example, in the preceding screenshot, we can see two indices: .marvel-2015.05.10 and eshadoop. You may be surprised that we never created an index with the name of .marvel-2015.05.10. You can ignore this index for the time being; we will take a brief look at it in the next subsection.
Let's go back to our WordCount
example. You can see that the document count for the eshadoop index in the preceding screenshot exactly matches with the number of documents metric indicated by the MapReduce job output that we saw in the last section.
The following diagram shows the Browser tab of the Elasticsearch Head plugin:

To take a look at the documents, navigate to the Browser tab. You can see that the screen is similar to the one shown in the preceding screenshot. You can click on the eshadoop index on the left-hand side under the Indices heading and sort the results by count to see the relevant documents. You can also see that the output of the MapReduce job is pushed directly to Elasticsearch. Further more, you can see the ES document fields, such as _index, _type, _id, and _score, along with the fields that we are interested in word and count. You may want to sort the results based on count by clicking on the count column to see the most frequent words in the sample.txt
file.
Marvel is a monitoring dashboard for real-time and historical analysis that is built on top of Kibana: a data visualization tool for ES-Hadoop. This dashboard provides, insight into the different metrics of the node, JVM, and ES-Hadoop internals. To open the Marvel dashboard, refer to your browser at http://localhost:9200/_plugin/marvel/
.
The following screenshot gives you an overview of the Marvel dashboard:

You can see the different real-time metrics for your cluster, nodes, and indices. You can visualize the trends of the document count, search, and the indexing request rates in a graphical way. This kind of visualization may be helpful to get a quick insight into the usage pattern of the index and find out the candidates for the purpose of performance optimization. It displays the vital monitoring stats, such as the CPU usage, the load, the JVM memory usage, the free disk space, and so on. You can also filter by time range in the top-right corner to use the dashboard for historical analysis. Marvel stores these historical data in a separate daily rolling index with a name pattern, such as .marvel-XXX
.
Sense is a plugin embedded in Marvel to provide a seamless and easy-to-use REST API client for the ES-Hadoop server. It is Elasticsearch-aware and frees you from memorizing the ES-Hadoop query syntaxes by providing autosuggestions. It also helps by indicating the typo or syntax errors.
To open the Sense user interface, open http://localhost:9200/_plugin/marvel/sense/index.html
in your browser.
The following screenshot shows the query interface of Sense:

Now, let's find out the documents imported in the eshadoop
index by executing the match_all
query.
Then, use the following query in the query panel on the left-hand side in the sense interface:
GET eshadoop/_search { "query": { "match_all":{} } }
Finally, click on the Send request button to execute the query and obtain the results.
In this chapter, we started by checking the prerequisites for how to install Hadoop and configured Hadoop in the pseudo-distributed mode. Then, we got the Elasticsearch server up and running and understood the basic configurations of Elasticsearch. We learned how to install the Elasticsearch plugins. We imported the sample file for the WordCount
example to HDFS and successfully ran our first Hadoop MapReduce job that uses ES-Hadoop to get the data to Elasticsearch. Then we learned how to use the Head and Marvel plugins to explore documents in Elasticsearch.
With our environment and the required tools set up with a basic understanding, we are all set to have a hands-on experience of how to write MapReduce jobs that use ES-Hadoop. In the next chapter, we will take a look at how the WordCount job is developed. We will also develop a couple of jobs for real-world scenarios that will write and read data to and from HDFS and Elasticsearch.