Nowadays, search is one of the primary functionalities needed in every application; it can be fulfilled by Elasticsearch, which also has many other extra features. Elasticsearch, which is built on top of Apache Lucene, is an open source, distributable, and highly scalable search engine. It provides extremely fast searches and makes data discovery easy.
In this chapter, we will cover the following topics:
Concepts and terminologies related to Elasticsearch
Rest API and the JSON data structure
Installing and configuring Elasticsearch
Installing the Elasticsearch plugins
Basic operations with Elasticsearch
Elasticsearch is a distributed, full text search and analytic engine that is build on top of Lucene, a search engine library written in Java, and is also a base for Solr. After its first release in 2010, Elasticsearch has been widely adopted by large as well as small organizations, including NASA, Wikipedia, and GitHub, for different use cases. The latest releases of Elasticsearch are focusing more on resiliency, which builds confidence in users being able to use Elasticsearch as a data storeage tool, apart from using it as a full text search engine. Elasticsearch ships with sensible default configurations and settings, and also hides all the complexities from beginners, which lets everyone become productive very quickly by just learning the basics.
Lucene is a blazing fast search library but it is tough to use directly and has very limited features to scale beyond a single machine. Elasticsearch comes to the rescue to overcome all the limitations of Lucene. Apart from providing a simple HTTP/JSON API, which enables language interoperability in comparison to Lucene's bare Java API, it has the following main features:
Distributed: Elasticsearch is distributed in nature from day one, and has been designed for scaling horizontally and not vertically. You can start with a single-node Elasticsearch cluster on your laptop and can scale that cluster to hundreds or thousands of nodes without worrying about the internal complexities that come with distributed computing, distributed document storage, and searches.
High Availability: Data replication means having multiple copies of data in your cluster. This feature enables users to create highly available clusters by keeping more than one copy of data. You just need to issue a simple command, and it automatically creates redundant copies of the data to provide higher availabilities and avoid data loss in the case of machine failure.
REST-based: Elasticsearch is based on REST architecture and provides API endpoints to not only perform CRUD operations over HTTP API calls, but also to enable users to perform cluster monitoring tasks using REST APIs. REST endpoints also enable users to make changes to clusters and indices settings dynamically, rather than manually pushing configuration updates to all the nodes in a cluster by editing the
elasticsearch.yml
file and restarting the node. This is possible because each resource (index, document, node, and so on) in Elasticsearch is accessible via a simple URI.Powerful Query DSL: Query DSL (domain-specific language) is a JSON interface provided by Elasticsearch to expose the power of Lucene to write and read queries in a very easy way. Thanks to the Query DSL, developers who are not aware of Lucene query syntaxes can also start writing complex queries in Elasticsearch.
Schemaless: Being schemaless means that you do not have to create a schema with field names and data types before indexing the data in Elasticsearch. Though it is one of the most misunderstood concepts, this is one of the biggest advantages we have seen in many organizations, especially in e-commerce sectors where it's difficult to define the schema in advance in some cases. When you send your first document to Elasticsearch, it tries its best to parse every field in the document and creates a schema itself. Next time, if you send another document with a different data type for the same field, it will discard the document. So, Elasticsearch is not completely schemaless but its dynamic behavior of creating a schema is very useful.
Elasticsearch is based on a REST design pattern and all the operations, for example, document insertion, deletion, updating, searching, and various monitoring and management tasks, can be performed using the REST endpoints provided by Elasticsearch.
In a REST-based web API, data and services are exposed as resources with URLs. All the requests are routed to a resource that is represented by a path. Each resource has a resource identifier, which is called as URI. All the potential actions on this resource can be done using simple request types provided by the HTTP protocol. The following are examples that describe how CRUD operations are done with REST API:
To create the user, use the following:
POST /user fname=Bharvi&lname=Dixit&age=28&id=123
The following command is used for retrieval:
GET /user/123
Use the following to update the user information:
PUT /user/123 fname=Lavleen
To delete the user, use this:
DELETE /user/123
Note
Many Elasticsearch users get confused between the
POST
andPUT
request types. The difference is simple.POST
is used to create a new resource, whilePUT
is used to update an existing resource. ThePUT
request is used during resource creation in some cases but it must have the complete URI available for this.
All the real-world data comes in object form. Every entity (object) has some properties. These properties can be in the form of simple key value pairs or they can be in the form of complex data structures. One property can have properties nested into it, and so on.
Elasticsearch is a document-oriented data store where objects, which are called as documents, are stored and retrieved in the form of JSON. These objects are not only stored, but also the content of these documents gets indexed to make them searchable.
JavaScript Object Notation (JSON) is a lightweight data interchange format and, in the NoSQL world, it has become a standard data serialization format. The primary reason behind using it as a standard format is the language independency and complex nested data structure that it supports. JSON has the following data type support:
Array, Boolean, Null, Number, Object, and String
The following is an example of a JSON object, which is self-explanatory about how these data types are stored in key value pairs:
{ "int_array": [1, 2,3], "string_array": ["Lucene" ,"Elasticsearch","NoSQL"], "boolean": true, "null": null, "number": 123, "object": { "a": "b", "c": "d", "e": "f" }, "string": "Learning Elasticsearch" }
The following are the most common terms that are very important to know when starting with Elasticsearch:
Node: A single instance of Elasticsearch running on a machine.
Cluster: A cluster is the single name under which one or more nodes/instances of Elasticsearch are connected to each other.
Document: A document is a JSON object that contains the actual data in key value pairs.
Index: A logical namespace under which Elasticsearch stores data, and may be built with more than one Lucene index using shards and replicas.
Doc types: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name, such as a user or a blog post, and a mapping, including data types and the Lucene configurations for each field. (An index can contain more than one type.)
Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable.
Note
A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching the data as well as for high availability; in a case where the machine that holds the primary shard goes down, then the secondary shard becomes the primary automatically.
Replica: A duplicate copy of the data living in a shard for high availability.
Elasticsearch is a search engine in the first place but, because of its rich functionality offerings, organizations have started using it as a NoSQL data store as well. However, it has not been made for maintaining the complex relationships that are offered by traditional relational databases.
If you want to understand Elasticsearch in relational database terms then, as shown in the following image, an index in Elasticsearch is similar to a database that consists of multiple types. A single row is represented as a document, and columns are similar to fields.

Elasticsearch does not have the concept of referential integrity constraints such as foreign keys. But, despite being a search engine and NoSQL data store, it does allow us to maintain some relationships among different documents, which will be discussed in the upcoming chapters.
With these theoretical concepts, we are good to go with learning the practical steps with Elasticsearch.
First of all, you need to be aware of the basic requirements to install and run Elasticsearch, which are listed as follows:
Java (Oracle Java 1.7u55 and above)
RAM: Minimum 2 GB
Root permission to install and configure program libraries
Note
Please go through the following URL to check the JVM and OS dependencies of Elasticsearch: https://www.elastic.co/subscriptions/matrix.
The most common error that comes up if you are using an incompatible Java version with Elasticsearch, is the following:
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/elasticsearch/bootstrap/Elasticsearch : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
If you see the preceding error while installing/working with Elasticsearch, it is most probably because you have an incompatible version of JAVA set as the JAVA_HOME
variable or not set at all. Many users install the latest version of JAVA but forget to set the JAVA_HOME
variable to the latest installation. If this variable is not set, then Elasticsearch looks into the following listed directories to find the JAVA and the first existing directory is used:
/usr/lib/jvm/jdk-7-oracle-x64, /usr/lib/jvm/java-7-oracle, /usr/lib/jvm/java-7-openjdk, /usr/lib/jvm/java-7-openjdk-amd64/, /usr/lib/jvm/java-7-openjdk-armhf, /usr/lib/jvm/java-7-openjdk-i386/, /usr/lib/jvm/default-java
I have used the Elasticsearch Version 2.0.0 in this book; you can choose to install other versions, if you wish to. You just need to replace the version number with 2.0.0. You need to have an administrative account to perform the installations and configurations.
Let's get started with installing Elasticsearch on Ubuntu Linux. The steps will be the same for all Ubuntu versions:
Download the Elasticsearch Version 2.0.0 Debian package:
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.0.0.deb
Install Elasticsearch, as follows:
sudo dpkg -i elasticsearch-2.0.0.deb
To run Elasticsearch as a service (to ensure Elasticsearch starts automatically when the system is booted), do the following:
sudo update-rc.d elasticsearch defaults 95 10
Follow these steps to install Elasticsearch on Centos machines. If you are using any other Red Hat Linux distribution, you can use the same commands, as follows:
Download the Elasticsearch Version 2.0.0 RPM package:
wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.0.0.rpm
Install Elasticsearch, using this command:
sudo rpm -i elasticsearch-2.0.0.rpm
To run Elasticsearch as a service (to ensure Elasticsearch starts automatically when the system is booted), use the following:
sudo systemctl daemon-reload sudo systemctl enable elasticsearch.service
The following table shows the directory layout of Elasticsearch that is created after installation. These directories, have some minor differences in paths depending upon the Linux distribution you are using.
Description |
Path on Debian/Ubuntu |
Path on RHEL/Centos |
---|---|---|
Elasticsearch home directory |
|
|
Elasticsearch and Lucene jar files |
|
|
Contains plugins |
|
|
The locations of the binary scripts that are used to start an ES node and download plugins |
|
|
Contains the Elasticsearch configuration files: ( |
|
|
Contains the data files of the index/shard allocated on that node |
|
|
The startup script for Elasticsearch (contains environment variables including HEAP SIZE and file descriptors) |
|
Or |
Contains the log files of Elasticsearch. |
|
|
During installation, a user and a group with the elasticsearch
name are created by default. Elasticsearch does not get started automatically just after installation. It is prevented from an automatic startup to avoid a connection to an already running node with the same cluster name.
Open the
elasticsearch.yml
file, which contains most of the Elasticsearch configuration options:sudo vim /etc/elasticsearch/elasticsearch.yml
Now, edit the following ones:
cluster.name
: The name of your clusternode.name
: The name of the nodepath.data
: The path where the data for the ES will be stored
After saving the
elasticsearch.yml
file, start Elasticsearch:sudo service elasticsearch start
Elasticsearch will start on two ports, as follows:
9200: This is used to create HTTP connections
9300: This is used to create a TCP connection through a JAVA client and the node's interconnection inside a cluster
Note
Do not forget to uncomment the lines you have edited. Please note that if you are using a new data path instead of the default one, then you first need to change the owner and the group of that data path to the user, elasticsearch.
The command to change the directory ownership to elasticsearch is as follows:
sudo chown –R elasticsearch:elasticsearch data_directory_path
Run the following command to check whether Elasticsearch has been started properly:
sudo service elasticsearch status
If the output of the preceding command is shown as elasticsearch is not running, then there must be some configuration issue. You can open the log file and see what is causing the error.
The list of possible issues that might prevent Elasticsearch from starting is:
A Java issue, as discussed previously
Indention issues in the
elasticsearch.yml
fileAt least 1 GB of RAM is not free to be used by Elasticsearch
The ownership of the data directory path is not changed to elasticsearch
Something is already running on port 9200 or 9300
Adding another node in a cluster is very simple. You just need to follow all the steps for installation on another system to install a new instance of Elasticsearch. However, keep the following in mind:
In the
elasticsearch.yml
file,cluster.name
is set to be the same on both the nodesBoth the systems should be reachable from each other over the network.
There is no firewall rule set for Elasticsearch port blocking
The Elasticsearch and JAVA versions are the same on both the nodes
You can optionally set the network.host
parameter to the IP address of the system to which you want Elasticsearch to be bound and the other nodes to communicate.
Plugins provide extra functionalities in a customized manner. They can be used to query, monitor, and manage tasks. Thanks to the wide Elasticsearch community, there are several easy-to-use plugins available. In this book, I will be discussing some of them.
The Elasticsearch plugins come in two flavors:
Site plugins: These are the plugins that have a site (web app) in them and do not contain any Java-related content. After installation, they are moved to the site directory and can be accessed using
es_ip:port/_plugin/plugin_name
.Java plugins: These mainly contain
.jar
files and are used to extend the functionalities of Elasticsearch. For example, the Carrot2 plugin that is used for text-clustering purposes.
Elasticsearch ships with a plugin script that is located in the /user/share/elasticsearch/bin
directory, and any plugin can be installed using this script in the following format:
bin/plugin --install plugin_url
Note
Once the plugin is installed, you need to restart that node to make it active. In the following image, you can see the different plugins installed inside the Elasticsearch node. Plugins need to be installed separately on each node of the cluster.
The following is the layout of the plugin directory of Elasticsearch:

You can check the log of your node that shows the following line at start up time:
[2015-09-06 14:16:02,606][INFO ][plugins ] [Matt Murdock] loaded [clustering-carrot2, marvel], sites [marvel, carrot2, head]
Alternatively, you can use the following command:
curl XGET 'localhost:9200/_nodes/plugins'?pretty
Another option is to use the following URL in your browser:
http://localhost:9200/_nodes/plugins
The Head plugin is a web front for the Elasticsearch cluster that is very easy to use. This plugin offers various features such as showing the graphical representations of shards, the cluster state, easy query creations, and downloading query-based data in the CSV format.
The following is the command to install the Head plugin:
sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head
Restart the Elasticsearch node with the following command to load the plugin:
sudo service elasticsearch restart
Once Elasticsearch is restarted, open the browser and type the following URL to access it through the Head plugin:
http://localhost:9200/_plugin/head
Note
More information about the Head plugin can be found here: https://github.com/mobz/elasticsearch-head
Sense is an awesome tool to query Elasticsearch. You can add it to your latest version of Chrome, Safari, or Firefox browsers as an extension.

Now, when Elasticsearch is installed and running in your system, and you have also installed the plugins, you are good to go with creating your first index and performing some basic operations.
We have already seen how Elasticsearch stores data and provides REST APIs to perform the operations. In next few sections, we will be performing some basic actions using the command line tool called CURL. Once you have grasped the basics, you will start programming and implementing these concepts using Python and Java in upcoming chapters.
Note
When we create an index, Elasticsearch by default creates five shards and one replica for each shard (this means five primary and five replica shards). This setting can be controlled in the elasticsearch.yml
file by changing the index.number_of_shards
properties and the index.number_of_replicas
settings, or it can also be provided while creating the index.
Once the index is created, the number of shards can't be increased or decreased; however, you can increase or decrease the number of replicas at any time after index creation. So it is better to choose the number of required shards for an index at the time of index creation.
Let's begin by creating our first index and give this index a name, which is book
in this case. After executing the following command, an index with five shards and one replica will be created:
curl –XPUT 'localhost:9200/books/'
Similar to all databases, Elasticsearch has the concept of having a unique identifier for each document that is known as _id
. This identifier is created in two ways, either you can provide your own unique ID while indexing the data, or if you don't provide any id, Elasticsearch creates a default id for that document. The following are the examples:
curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Analytics","Text Search","Elasticsearch"], "content":"Added with PUT request" }'
On executing above command, Elasticsearch will give the following response:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":1,"created":true}
However, if you do not provide an id, which is 1
in our case, then you will get the following error:
No handler found for uri [/books/elasticsearch] and method [PUT]
The reason behind the preceding error is that we are using a PUT
request to create a document. However, Elasticsearch has no idea where to store this document (no existing URI for the document is available).
If you want the _id
to be auto generated, you have to use a POST
request. For example:
curl -XPOST 'localhost:9200/books/elasticsearch' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","Elasticsearch"], "content":"Added with POST request" }'
The response from the preceding request will be as follows:
{"_index":"books","_type":"elasticsearch","_id":"AU-ityC8xdEEi6V7cMV5","_version":1,"created":true}
If you open the localhost:9200/_plugin/head
URL, you can perform all the CRUD operations using the HEAD plugin as well:

Some of the stats that you can see in the preceding image are these:
Node name:
node-1
Index name: books
No. of primary shards: 5
No. of docs in the index: 2
No. of unassigned shards (replica shards): 5
Note
Cluster states in Elasticsearch
An Elasticsearch cluster can be in one of the three states: GREEN, YELLOW, or RED. If all the shards, meaning primary as well as replicas, are assigned in the cluster, it will be in the GREEN state. If any one of the replica shards is not assigned because of any problem, then the cluster will be in the YELLOW state. If any one of the primary shards is not assigned on a node, then the cluster will be in the RED state. We will see more on these states in the upcoming chapters. Elasticsearch never assigns a primary and its replica shard on the same node.
We have stored documents in Elasticsearch. Now we can fetch them using their unique ids with a simple GET
request.
We have already indexed our document. Now, we can get the document using its document identifier by executing the following command:
curl -XGET 'localhost:9200/books/elasticsearch/1'?pretty
The output of the preceding command is as follows:
{ "_index" : "books", "_type" : "elasticsearch", "_id" : "1", "_version" : 1, "found" : true, "_source":{"name":"Elasticsearch Essentials","author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","ELasticsearch"],"content":"Added with PUT request"} }
As you can see, there is a _source
field in the response. This is a special field reserved by Elasticsearch to store all the JSON data. There are options available to not store the data in this field since it comes with an extra disk space requirement. However, this also helps in many ways while returning data from ES, re-indexing data, or doing partial document updates. We will see more on this field in the next chapters.
If the document did not exist in the index, the _found
field would have been marked as false.
Sometimes you need only some of the fields to be returned instead of returning the complete document. For these scenarios, you can send the names of the fields to be returned inside the _source
parameter with the GET
request:
curl -XGET 'localhost:9200/books/elasticsearch/1'?_source=name,author
The response of Elasticsearch will be as follows:
{ "_index":"books", "_type":"elasticsearch", "_id":"1", "_version":1, "found":true, "_source":{"author":"Bharvi Dixit","name":"Elasticsearch Essentials"} }
It is possible to update documents in Elasticsearch, which can be done either completely or partially, but updates come with some limitations and costs. In the next sections, we will see how these operations can be performed and how things work behind the scenes.
To update a whole document, you can use a similar PUT
/POST
request, which we had used to create a new document:
curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{ "name":"Elasticsearch Essentials", "author":"Bharvi Dixit", "tags":["Data Analytics","Text Search","Elasticsearch"], "content":"Updated document", "publisher":"pact-pub" }'
The response of Elasticsearch looks like this:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":2,"created":false}
If you look at the response, it shows _version
is 2
and created
is false
, meaning the document is updated.
Instead of updating the whole document, we can use the _update
API to do partial updates. As shown in the following example, we will add a new field, updated_time
, to the document for which a script parameter has been used. Elasticsearch uses Groovy scripting by default.
Note
Scripting is by default disabled in Elasticsearch, so to use a script you need to enable it by adding the following parameter to your elasticsearch.yml
file:
script.inline: on
curl -XPOST 'localhost:9200/books/elasticsearch/1/_update' -d '{ "script" : "ctx._source.updated_time= \"2015-09-09T00:00:00\"" }'
The response of the preceding request will be this:
{"_index":"books","_type":"elasticsearch","_id":"1","_version":3}
It shows that a new version has been created in Elasticsearch.
Elasticsearch stores data in indexes that are composed of Lucene segments. These segments are immutable in nature, meaning that, once created, they can't be changed. So, when we send an update request to Elasticsearch, it does the following things in the background:
Fetches the JSON data from the
_source
field for that documentMakes changes in the
_source
fieldDeletes old documents
Creates a new document
All these data re-indexing tasks can be done by the user; however, if you are using the UPDATE
method, it is done using only one request. These processes are the same when doing a whole document update as for a partial update. The benefit of a partial update is that all operations are done within a single shard, which avoids network overhead.
To delete a document using its identifier, we need to use the DELETE
request:
curl -XDELETE 'localhost:9200/books/elasticsearch/1'
The following is the response of Elasticsearch:
{"found":true,"_index":"books","_type":"elasticsearch","_id":"1","_version":4}
If you are from a Lucene background, then you must know how segment merging is done and how new segments are created in the background with more documents getting indexed. Whenever we delete a document from Elasticsearch, it does not get deleted from the file system right away. Rather, Elasticsearch just marks that document as deleted, and when you index more data, segment merging is done. At the same time, the documents that are marked as deleted are indeed deleted based on a merge policy. This process is also applied while the document is updated.
The space from deleted documents can also be reclaimed with the _optimize
API by executing the following command:
curl –XPOST http://localhost:9200/_optimize?only_expunge_deletes=true'
While developing applications, some scenarios require you to check whether a document exists or not in Elasticsearch. In these scenarios, rather than querying the documents with a GET
request, you have the option of using another HTTP
request method called HEAD
:
curl -i -XHEAD 'localhost:9200/books/elasticsearch/1'
The following is the response of the preceding command:
HTTP/1.1 200 OK Content-Type: text/plain; charset=UTF-8 Content-Length: 0
In the preceding command, I have used the -i
parameter that is used to show the header information of an HTTP
response. It has been used because the HEAD
request only returns headers and not any content. If the document is found, then status code will be 200
, and if not, then it will be 400
.
A lot of things have been covered in this chapter. You have got to know about the Elasticsearch architecture and its workings. Then, you have learned about the installations of Elasticsearch and its plugins. Finally, basic operations with Elasticsearch were done.
With all these, you are ready to learn about data analysis phases and mappings in the next chapter.