Elasticsearch Essentials

By Bharvi Dixit
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Getting Started with Elasticsearch

About this book

With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. ElasticSearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazing fast search solutions over a massive amount of data, but can also serve as a NoSQL data store.

This guide will take you on a tour to become a competent developer quickly with a solid knowledge level and understanding of the ElasticSearch core concepts. Starting from the beginning, this book will cover these core concepts, setting up ElasticSearch and various plugins, working with analyzers, and creating mappings. This book provides complete coverage of working with ElasticSearch using Python and performing CRUD operations and aggregation-based analytics, handling document relationships in the NoSQL world, working with geospatial data, and taking data backups. Finally, we’ll show you how to set up and scale ElasticSearch clusters in production environments as well as providing some best practices.

Publication date:
January 2016
Publisher
Packt
Pages
240
ISBN
9781784391010

 

Chapter 1. Getting Started with Elasticsearch

Nowadays, search is one of the primary functionalities needed in every application; it can be fulfilled by Elasticsearch, which also has many other extra features. Elasticsearch, which is built on top of Apache Lucene, is an open source, distributable, and highly scalable search engine. It provides extremely fast searches and makes data discovery easy.

In this chapter, we will cover the following topics:

  • Concepts and terminologies related to Elasticsearch

  • Rest API and the JSON data structure

  • Installing and configuring Elasticsearch

  • Installing the Elasticsearch plugins

  • Basic operations with Elasticsearch

 

Introducing Elasticsearch


Elasticsearch is a distributed, full text search and analytic engine that is build on top of Lucene, a search engine library written in Java, and is also a base for Solr. After its first release in 2010, Elasticsearch has been widely adopted by large as well as small organizations, including NASA, Wikipedia, and GitHub, for different use cases. The latest releases of Elasticsearch are focusing more on resiliency, which builds confidence in users being able to use Elasticsearch as a data storeage tool, apart from using it as a full text search engine. Elasticsearch ships with sensible default configurations and settings, and also hides all the complexities from beginners, which lets everyone become productive very quickly by just learning the basics.

The primary features of Elasticsearch

Lucene is a blazing fast search library but it is tough to use directly and has very limited features to scale beyond a single machine. Elasticsearch comes to the rescue to overcome all the limitations of Lucene. Apart from providing a simple HTTP/JSON API, which enables language interoperability in comparison to Lucene's bare Java API, it has the following main features:

  • Distributed: Elasticsearch is distributed in nature from day one, and has been designed for scaling horizontally and not vertically. You can start with a single-node Elasticsearch cluster on your laptop and can scale that cluster to hundreds or thousands of nodes without worrying about the internal complexities that come with distributed computing, distributed document storage, and searches.

  • High Availability: Data replication means having multiple copies of data in your cluster. This feature enables users to create highly available clusters by keeping more than one copy of data. You just need to issue a simple command, and it automatically creates redundant copies of the data to provide higher availabilities and avoid data loss in the case of machine failure.

  • REST-based: Elasticsearch is based on REST architecture and provides API endpoints to not only perform CRUD operations over HTTP API calls, but also to enable users to perform cluster monitoring tasks using REST APIs. REST endpoints also enable users to make changes to clusters and indices settings dynamically, rather than manually pushing configuration updates to all the nodes in a cluster by editing the elasticsearch.yml file and restarting the node. This is possible because each resource (index, document, node, and so on) in Elasticsearch is accessible via a simple URI.

  • Powerful Query DSL: Query DSL (domain-specific language) is a JSON interface provided by Elasticsearch to expose the power of Lucene to write and read queries in a very easy way. Thanks to the Query DSL, developers who are not aware of Lucene query syntaxes can also start writing complex queries in Elasticsearch.

  • Schemaless: Being schemaless means that you do not have to create a schema with field names and data types before indexing the data in Elasticsearch. Though it is one of the most misunderstood concepts, this is one of the biggest advantages we have seen in many organizations, especially in e-commerce sectors where it's difficult to define the schema in advance in some cases. When you send your first document to Elasticsearch, it tries its best to parse every field in the document and creates a schema itself. Next time, if you send another document with a different data type for the same field, it will discard the document. So, Elasticsearch is not completely schemaless but its dynamic behavior of creating a schema is very useful.

    Note

    There are many more features available in Elasticsearch, such as multitenancy and percolation, which will be discussed in detail in the next chapters.

Understanding REST and JSON

Elasticsearch is based on a REST design pattern and all the operations, for example, document insertion, deletion, updating, searching, and various monitoring and management tasks, can be performed using the REST endpoints provided by Elasticsearch.

What is REST?

In a REST-based web API, data and services are exposed as resources with URLs. All the requests are routed to a resource that is represented by a path. Each resource has a resource identifier, which is called as URI. All the potential actions on this resource can be done using simple request types provided by the HTTP protocol. The following are examples that describe how CRUD operations are done with REST API:

  • To create the user, use the following:

    POST /user
    fname=Bharvi&lname=Dixit&age=28&id=123
    
  • The following command is used for retrieval:

    GET /user/123
    
  • Use the following to update the user information:

    PUT /user/123
    fname=Lavleen
    
  • To delete the user, use this:

    DELETE /user/123
    

    Note

    Many Elasticsearch users get confused between the POST and PUT request types. The difference is simple. POST is used to create a new resource, while PUT is used to update an existing resource. The PUT request is used during resource creation in some cases but it must have the complete URI available for this.

What is JSON?

All the real-world data comes in object form. Every entity (object) has some properties. These properties can be in the form of simple key value pairs or they can be in the form of complex data structures. One property can have properties nested into it, and so on.

Elasticsearch is a document-oriented data store where objects, which are called as documents, are stored and retrieved in the form of JSON. These objects are not only stored, but also the content of these documents gets indexed to make them searchable.

JavaScript Object Notation (JSON) is a lightweight data interchange format and, in the NoSQL world, it has become a standard data serialization format. The primary reason behind using it as a standard format is the language independency and complex nested data structure that it supports. JSON has the following data type support:

Array, Boolean, Null, Number, Object, and String

The following is an example of a JSON object, which is self-explanatory about how these data types are stored in key value pairs:

{
  "int_array": [1, 2,3],
  "string_array": ["Lucene" ,"Elasticsearch","NoSQL"],
  "boolean": true,
  "null": null,
  "number": 123,
  "object": {
    "a": "b",
    "c": "d",
    "e": "f"
  },
  "string": "Learning Elasticsearch"
}

Elasticsearch common terms

The following are the most common terms that are very important to know when starting with Elasticsearch:

  • Node: A single instance of Elasticsearch running on a machine.

  • Cluster: A cluster is the single name under which one or more nodes/instances of Elasticsearch are connected to each other.

  • Document: A document is a JSON object that contains the actual data in key value pairs.

  • Index: A logical namespace under which Elasticsearch stores data, and may be built with more than one Lucene index using shards and replicas.

  • Doc types: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name, such as a user or a blog post, and a mapping, including data types and the Lucene configurations for each field. (An index can contain more than one type.)

  • Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable.

    Note

    A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching the data as well as for high availability; in a case where the machine that holds the primary shard goes down, then the secondary shard becomes the primary automatically.

  • Replica: A duplicate copy of the data living in a shard for high availability.

Understanding Elasticsearch structure with respect to relational databases

Elasticsearch is a search engine in the first place but, because of its rich functionality offerings, organizations have started using it as a NoSQL data store as well. However, it has not been made for maintaining the complex relationships that are offered by traditional relational databases.

If you want to understand Elasticsearch in relational database terms then, as shown in the following image, an index in Elasticsearch is similar to a database that consists of multiple types. A single row is represented as a document, and columns are similar to fields.

Elasticsearch does not have the concept of referential integrity constraints such as foreign keys. But, despite being a search engine and NoSQL data store, it does allow us to maintain some relationships among different documents, which will be discussed in the upcoming chapters.

With these theoretical concepts, we are good to go with learning the practical steps with Elasticsearch.

First of all, you need to be aware of the basic requirements to install and run Elasticsearch, which are listed as follows:

  • Java (Oracle Java 1.7u55 and above)

  • RAM: Minimum 2 GB

  • Root permission to install and configure program libraries

    Note

    Please go through the following URL to check the JVM and OS dependencies of Elasticsearch: https://www.elastic.co/subscriptions/matrix.

The most common error that comes up if you are using an incompatible Java version with Elasticsearch, is the following:

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/elasticsearch/bootstrap/Elasticsearch : Unsupported major.minor version 51.0
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
  at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

If you see the preceding error while installing/working with Elasticsearch, it is most probably because you have an incompatible version of JAVA set as the JAVA_HOME variable or not set at all. Many users install the latest version of JAVA but forget to set the JAVA_HOME variable to the latest installation. If this variable is not set, then Elasticsearch looks into the following listed directories to find the JAVA and the first existing directory is used:

/usr/lib/jvm/jdk-7-oracle-x64,  /usr/lib/jvm/java-7-oracle,  /usr/lib/jvm/java-7-openjdk,  /usr/lib/jvm/java-7-openjdk-amd64/,  /usr/lib/jvm/java-7-openjdk-armhf,  /usr/lib/jvm/java-7-openjdk-i386/, /usr/lib/jvm/default-java
 

Installing and configuring Elasticsearch


I have used the Elasticsearch Version 2.0.0 in this book; you can choose to install other versions, if you wish to. You just need to replace the version number with 2.0.0. You need to have an administrative account to perform the installations and configurations.

Installing Elasticsearch on Ubuntu through Debian package

Let's get started with installing Elasticsearch on Ubuntu Linux. The steps will be the same for all Ubuntu versions:

  1. Download the Elasticsearch Version 2.0.0 Debian package:

    wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.0.0.deb
    
  2. Install Elasticsearch, as follows:

    sudo dpkg -i elasticsearch-2.0.0.deb
    
  3. To run Elasticsearch as a service (to ensure Elasticsearch starts automatically when the system is booted), do the following:

    sudo update-rc.d elasticsearch defaults 95 10
    

Installing Elasticsearch on Centos through the RPM package

Follow these steps to install Elasticsearch on Centos machines. If you are using any other Red Hat Linux distribution, you can use the same commands, as follows:

  1. Download the Elasticsearch Version 2.0.0 RPM package:

    wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-2.0.0.rpm
    
  2. Install Elasticsearch, using this command:

    sudo rpm -i elasticsearch-2.0.0.rpm
    
  3. To run Elasticsearch as a service (to ensure Elasticsearch starts automatically when the system is booted), use the following:

    sudo systemctl daemon-reload
    sudo systemctl enable elasticsearch.service
    

Understanding the Elasticsearch installation directory layout

The following table shows the directory layout of Elasticsearch that is created after installation. These directories, have some minor differences in paths depending upon the Linux distribution you are using.

Description

Path on Debian/Ubuntu

Path on RHEL/Centos

Elasticsearch home directory

/usr/share/elasticsearch

/usr/share/elasticsearch

Elasticsearch and Lucene jar files

/usr/share/elasticsearch/lib

/usr/share/elasticsearch/lib

Contains plugins

/usr/share/elasticsearch/plugins

/usr/share/elasticsearch/plugins

The locations of the binary scripts that are used to start an ES node and download plugins

usr/share/elasticsearch/bin

usr/share/elasticsearch/bin

Contains the Elasticsearch configuration files: (elasticsearch.yml and logging.yml)

/etc/elasticsearch

/etc/elasticsearch

Contains the data files of the index/shard allocated on that node

/var/lib/elasticsearch/data

/var/lib/elasticsearch/data

The startup script for Elasticsearch (contains environment variables including HEAP SIZE and file descriptors)

/etc/init.d/elasticsearch

/etc/sysconfig/elasticsearch

Or /etc/init.d/elasticsearch

Contains the log files of Elasticsearch.

/var/log/elasticsearch/

/var/log/elasticsearch/

During installation, a user and a group with the elasticsearch name are created by default. Elasticsearch does not get started automatically just after installation. It is prevented from an automatic startup to avoid a connection to an already running node with the same cluster name.

Note

It is recommended to change the cluster name before starting Elasticsearch for the first time.

Configuring basic parameters

  1. Open the elasticsearch.yml file, which contains most of the Elasticsearch configuration options:

    sudo vim /etc/elasticsearch/elasticsearch.yml
    
  2. Now, edit the following ones:

    • cluster.name: The name of your cluster

    • node.name: The name of the node

    • path.data: The path where the data for the ES will be stored

    Note

    Similar to path.data, we can change path.logs and path.plugins as well. Make sure all these parameters values are inside double quotes.

  3. After saving the elasticsearch.yml file, start Elasticsearch:

    sudo service elasticsearch start
    

    Elasticsearch will start on two ports, as follows:

    • 9200: This is used to create HTTP connections

    • 9300: This is used to create a TCP connection through a JAVA client and the node's interconnection inside a cluster

      Note

      Do not forget to uncomment the lines you have edited. Please note that if you are using a new data path instead of the default one, then you first need to change the owner and the group of that data path to the user, elasticsearch.

      The command to change the directory ownership to elasticsearch is as follows:

      sudo chown –R elasticsearch:elasticsearch data_directory_path
      
  4. Run the following command to check whether Elasticsearch has been started properly:

    sudo service elasticsearch status
    

    If the output of the preceding command is shown as elasticsearch is not running, then there must be some configuration issue. You can open the log file and see what is causing the error.

The list of possible issues that might prevent Elasticsearch from starting is:

  • A Java issue, as discussed previously

  • Indention issues in the elasticsearch.yml file

  • At least 1 GB of RAM is not free to be used by Elasticsearch

  • The ownership of the data directory path is not changed to elasticsearch

  • Something is already running on port 9200 or 9300

Adding another node to the cluster

Adding another node in a cluster is very simple. You just need to follow all the steps for installation on another system to install a new instance of Elasticsearch. However, keep the following in mind:

  • In the elasticsearch.yml file, cluster.name is set to be the same on both the nodes

  • Both the systems should be reachable from each other over the network.

  • There is no firewall rule set for Elasticsearch port blocking

  • The Elasticsearch and JAVA versions are the same on both the nodes

You can optionally set the network.host parameter to the IP address of the system to which you want Elasticsearch to be bound and the other nodes to communicate.

Installing Elasticsearch plugins

Plugins provide extra functionalities in a customized manner. They can be used to query, monitor, and manage tasks. Thanks to the wide Elasticsearch community, there are several easy-to-use plugins available. In this book, I will be discussing some of them.

The Elasticsearch plugins come in two flavors:

  • Site plugins: These are the plugins that have a site (web app) in them and do not contain any Java-related content. After installation, they are moved to the site directory and can be accessed using es_ip:port/_plugin/plugin_name.

  • Java plugins: These mainly contain .jar files and are used to extend the functionalities of Elasticsearch. For example, the Carrot2 plugin that is used for text-clustering purposes.

Elasticsearch ships with a plugin script that is located in the /user/share/elasticsearch/bin directory, and any plugin can be installed using this script in the following format:

bin/plugin --install plugin_url

Note

Once the plugin is installed, you need to restart that node to make it active. In the following image, you can see the different plugins installed inside the Elasticsearch node. Plugins need to be installed separately on each node of the cluster.

The following is the layout of the plugin directory of Elasticsearch:

Checking for installed plugins

You can check the log of your node that shows the following line at start up time:

[2015-09-06 14:16:02,606][INFO ][plugins                  ] [Matt Murdock] loaded [clustering-carrot2, marvel], sites [marvel, carrot2, head]

Alternatively, you can use the following command:

curl XGET 'localhost:9200/_nodes/plugins'?pretty

Another option is to use the following URL in your browser:

http://localhost:9200/_nodes/plugins

Installing the Head plugin for Elasticsearch

The Head plugin is a web front for the Elasticsearch cluster that is very easy to use. This plugin offers various features such as showing the graphical representations of shards, the cluster state, easy query creations, and downloading query-based data in the CSV format.

The following is the command to install the Head plugin:

sudo /usr/share/elasticsearch/bin/plugin -install mobz/elasticsearch-head

Restart the Elasticsearch node with the following command to load the plugin:

sudo service elasticsearch restart

Once Elasticsearch is restarted, open the browser and type the following URL to access it through the Head plugin:

http://localhost:9200/_plugin/head

Note

More information about the Head plugin can be found here: https://github.com/mobz/elasticsearch-head

Installing Sense for Elasticsearch

Sense is an awesome tool to query Elasticsearch. You can add it to your latest version of Chrome, Safari, or Firefox browsers as an extension.

Now, when Elasticsearch is installed and running in your system, and you have also installed the plugins, you are good to go with creating your first index and performing some basic operations.

 

Basic operations with Elasticsearch


We have already seen how Elasticsearch stores data and provides REST APIs to perform the operations. In next few sections, we will be performing some basic actions using the command line tool called CURL. Once you have grasped the basics, you will start programming and implementing these concepts using Python and Java in upcoming chapters.

Note

When we create an index, Elasticsearch by default creates five shards and one replica for each shard (this means five primary and five replica shards). This setting can be controlled in the elasticsearch.yml file by changing the index.number_of_shards properties and the index.number_of_replicas settings, or it can also be provided while creating the index.

Once the index is created, the number of shards can't be increased or decreased; however, you can increase or decrease the number of replicas at any time after index creation. So it is better to choose the number of required shards for an index at the time of index creation.

Creating an Index

Let's begin by creating our first index and give this index a name, which is book in this case. After executing the following command, an index with five shards and one replica will be created:

curl –XPUT 'localhost:9200/books/'

Note

Uppercase letters and blank spaces are not allowed in index names.

Indexing a document in Elasticsearch

Similar to all databases, Elasticsearch has the concept of having a unique identifier for each document that is known as _id. This identifier is created in two ways, either you can provide your own unique ID while indexing the data, or if you don't provide any id, Elasticsearch creates a default id for that document. The following are the examples:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Added with PUT request"
}'

On executing above command, Elasticsearch will give the following response:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":1,"created":true}

However, if you do not provide an id, which is 1 in our case, then you will get the following error:

No handler found for uri [/books/elasticsearch] and method [PUT] 

The reason behind the preceding error is that we are using a PUT request to create a document. However, Elasticsearch has no idea where to store this document (no existing URI for the document is available).

If you want the _id to be auto generated, you have to use a POST request. For example:

curl -XPOST 'localhost:9200/books/elasticsearch' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Anlytics","Text Search","Elasticsearch"],
"content":"Added with POST request"
}'

The response from the preceding request will be as follows:

{"_index":"books","_type":"elasticsearch","_id":"AU-ityC8xdEEi6V7cMV5","_version":1,"created":true}

If you open the localhost:9200/_plugin/head URL, you can perform all the CRUD operations using the HEAD plugin as well:

Some of the stats that you can see in the preceding image are these:

  • Cluster name: elasticsearch_cluster

  • Node name: node-1

  • Index name: books

  • No. of primary shards: 5

  • No. of docs in the index: 2

  • No. of unassigned shards (replica shards): 5

    Note

    Cluster states in Elasticsearch

    An Elasticsearch cluster can be in one of the three states: GREEN, YELLOW, or RED. If all the shards, meaning primary as well as replicas, are assigned in the cluster, it will be in the GREEN state. If any one of the replica shards is not assigned because of any problem, then the cluster will be in the YELLOW state. If any one of the primary shards is not assigned on a node, then the cluster will be in the RED state. We will see more on these states in the upcoming chapters. Elasticsearch never assigns a primary and its replica shard on the same node.

Fetching documents

We have stored documents in Elasticsearch. Now we can fetch them using their unique ids with a simple GET request.

Get a complete document

We have already indexed our document. Now, we can get the document using its document identifier by executing the following command:

curl -XGET 'localhost:9200/books/elasticsearch/1'?pretty

The output of the preceding command is as follows:

{
  "_index" : "books",
  "_type" : "elasticsearch",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{"name":"Elasticsearch Essentials","author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","ELasticsearch"],"content":"Added with PUT request"}
}

Note

pretty is used in the preceding request to make the response nicer and more readable.

As you can see, there is a _source field in the response. This is a special field reserved by Elasticsearch to store all the JSON data. There are options available to not store the data in this field since it comes with an extra disk space requirement. However, this also helps in many ways while returning data from ES, re-indexing data, or doing partial document updates. We will see more on this field in the next chapters.

If the document did not exist in the index, the _found field would have been marked as false.

Getting part of a document

Sometimes you need only some of the fields to be returned instead of returning the complete document. For these scenarios, you can send the names of the fields to be returned inside the _source parameter with the GET request:

curl -XGET 'localhost:9200/books/elasticsearch/1'?_source=name,author

The response of Elasticsearch will be as follows:

{
"_index":"books",
"_type":"elasticsearch",
"_id":"1",
"_version":1,
"found":true,
"_source":{"author":"Bharvi Dixit","name":"Elasticsearch Essentials"}
}

Updating documents

It is possible to update documents in Elasticsearch, which can be done either completely or partially, but updates come with some limitations and costs. In the next sections, we will see how these operations can be performed and how things work behind the scenes.

Updating a whole document

To update a whole document, you can use a similar PUT/POST request, which we had used to create a new document:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Updated document",
"publisher":"pact-pub"
}'

The response of Elasticsearch looks like this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":2,"created":false}

If you look at the response, it shows _version is 2 and created is false, meaning the document is updated.

Updating documents partially

Instead of updating the whole document, we can use the _update API to do partial updates. As shown in the following example, we will add a new field, updated_time, to the document for which a script parameter has been used. Elasticsearch uses Groovy scripting by default.

Note

Scripting is by default disabled in Elasticsearch, so to use a script you need to enable it by adding the following parameter to your elasticsearch.yml file:

script.inline: on
curl -XPOST 'localhost:9200/books/elasticsearch/1/_update' -d '{

   "script" : "ctx._source.updated_time= \"2015-09-09T00:00:00\""

}'

The response of the preceding request will be this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":3}

It shows that a new version has been created in Elasticsearch.

Elasticsearch stores data in indexes that are composed of Lucene segments. These segments are immutable in nature, meaning that, once created, they can't be changed. So, when we send an update request to Elasticsearch, it does the following things in the background:

  • Fetches the JSON data from the _source field for that document

  • Makes changes in the _source field

  • Deletes old documents

  • Creates a new document

All these data re-indexing tasks can be done by the user; however, if you are using the UPDATE method, it is done using only one request. These processes are the same when doing a whole document update as for a partial update. The benefit of a partial update is that all operations are done within a single shard, which avoids network overhead.

Deleting documents

To delete a document using its identifier, we need to use the DELETE request:

curl -XDELETE 'localhost:9200/books/elasticsearch/1'

The following is the response of Elasticsearch:

{"found":true,"_index":"books","_type":"elasticsearch","_id":"1","_version":4}

If you are from a Lucene background, then you must know how segment merging is done and how new segments are created in the background with more documents getting indexed. Whenever we delete a document from Elasticsearch, it does not get deleted from the file system right away. Rather, Elasticsearch just marks that document as deleted, and when you index more data, segment merging is done. At the same time, the documents that are marked as deleted are indeed deleted based on a merge policy. This process is also applied while the document is updated.

The space from deleted documents can also be reclaimed with the _optimize API by executing the following command:

curl –XPOST http://localhost:9200/_optimize?only_expunge_deletes=true'

Checking documents' existence

While developing applications, some scenarios require you to check whether a document exists or not in Elasticsearch. In these scenarios, rather than querying the documents with a GET request, you have the option of using another HTTP request method called HEAD:

curl -i -XHEAD 'localhost:9200/books/elasticsearch/1'

The following is the response of the preceding command:

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

In the preceding command, I have used the -i parameter that is used to show the header information of an HTTP response. It has been used because the HEAD request only returns headers and not any content. If the document is found, then status code will be 200, and if not, then it will be 400.

 

Summary


A lot of things have been covered in this chapter. You have got to know about the Elasticsearch architecture and its workings. Then, you have learned about the installations of Elasticsearch and its plugins. Finally, basic operations with Elasticsearch were done.

With all these, you are ready to learn about data analysis phases and mappings in the next chapter.

About the Author

  • Bharvi Dixit

    Bharvi Dixit is an IT professional with extensive experience of working on search servers, NoSQL databases, and cloud services. He holds a master's degree in computer science and is currently working with Sentieo, a USA-based financial data and equity research platform, where he leads the overall platform and architecture of the company spanning across hundreds of servers. At Sentieo, he also plays a key role in the search and data team.

    He is also the organizer of Delhi's Elasticsearch Meetup Group, where he speaks about Elasticsearch and Lucene and is continuously building the community around these technologies.

    Bharvi also works as a freelance Elasticsearch consultant and has helped more than half a dozen organizations adapt Elasticsearch to solve their complex search problems around different use cases, such as creating search solutions for big data-automated intelligence platforms in the area of counter-terrorism and risk management, as well as in other domains, such as recruitment, e-commerce, finance, social search, and log monitoring.

    He has a keen interest in creating scalable backend platforms. His other areas of interests are search engineering, data analytics, and distributed computing. Java and Python are the primary languages in which he loves to write code. He has also built a proprietary software for consultancy firms.

    In 2013, he started working on Lucene and Elasticsearch, and in 2016, he authored his first book, Elasticsearch Essentials, which was published by Packt. He has also worked as a technical reviewer for the book Learning Kibana 5.0 by Packt.

    You can connect with him on LinkedIn at https://in.linkedin.com/in/bharvidixit or can follow him on Twitter @d_bharvi.

    Browse publications by this author
Book Title
Unlock this full book FREE 10 day trial
Start Free Trial