Home Data Mastering Elasticsearch 5.x - Third Edition

Mastering Elasticsearch 5.x - Third Edition

books-svg-icon Book
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Revisiting Elasticsearch and the Changes
About this book
Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. Elasticsearch leverages the capabilities of Apache Lucene, and provides a new level of control over how you can index and search even huge sets of data. This book will give you a brief recap of the basics and also introduce you to the new features of Elasticsearch 5. We will guide you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. We’ll also explore advanced concepts, including aggregation, index control, sharding, replication, and clustering. We’ll show you the modules of monitoring and administration available in Elasticsearch, and will also cover backup and recovery. You will get an understanding of how you can scale your Elasticsearch cluster to contextualize it and improve its performance. We’ll also show you how you can create your own analysis plugin in Elasticsearch. By the end of the book, you will have all the knowledge necessary to master Elasticsearch and put it to efficient use.
Publication date:
February 2017
Publisher
Packt
Pages
428
ISBN
9781786460189

 

Chapter 1.  Revisiting Elasticsearch and the Changes

Welcome to Mastering Elasticsearch 5.x, Third Edition. Elasticsearch has progressed rapidly from version 1.x, released in 2014, to version 5.x, released in 2016. During the two-and-a-half-year period since 1.0.0, adoption has skyrocketed, and both vendors and the community have committed bug-fixes, interoperability enhancements, and rich feature upgrades to ensure Elasticsearch remains the most popular NoSQL storage, indexing, and search utility for both structured and unstructured documents, as well as gaining popularity as a log analysis tool as part of the Elastic Stack.

We treat Mastering Elasticsearch as a book that will systematize your knowledge about Elasticsearch, and extend it by showing some examples of how to leverage your knowledge in certain situations. If you are looking for a book that will help you start your journey into the world of Elasticsearch, please take a look at Elasticsearch Essentials, also published by Packt.

Before going further into the book, we assume that you already know the basic concepts of Elasticsearch for performing operations such as how to index documents, how to send queries to get the documents you are interested in, how to narrow down the results of your queries by using filters, and how to calculate statistics for your data with the use of the aggregation mechanism. However, before getting to the exciting functionality that Elasticsearch offers, we think we should start with a quick overview of Apache Lucene, which is a full text search library that Elasticsearch uses to build and search its indices. We also need to make sure that we understand Lucene correctly, as Mastering Elasticsearch requires this understanding. By the end of this chapter, we will have covered the following topics:

  • An overview of Lucene and Elasticsearch

  • Introducing Elasticsearch 5.x

  • Latest features introduced in Elasticsearch

  • The changes in Elasticsearch after 1.x

 

An overview of Lucene


In order to fully understand how Elasticsearch works, especially when it comes to indexing and query processing, it is crucial to understand how the Apache Lucene library works. Under the hood, Elasticsearch uses Lucene to handle document indexing. The same library is also used to perform a search against the indexed documents. In the next few pages, we will try to show you the basics of Apache Lucene, just in case you've never used it.

Lucene is a mature, open source, highly performing, scalable, light and, yet, very powerful library written in Java. Its core comes as a single file of the Java library with no dependencies, and allows you to index documents and search them with its out-of-the-box full text search capabilities. Of course, there are extensions to Apache Lucene that allow different language handling, and enable spellchecking, highlighting, and much more, but if you don't need those features, you can download a single file and use it in your application.

Getting deeper into the Lucene index

In order to fully understand Lucene, the following terminologies need to be understood first:

  • Document: This is a main data carrier used during indexing and search, containing one or more fields, which contains the data we put and get from Lucene.

  • Field: This is a section of the document which is built of two parts: the name and the value.

  • Term: This is a unit of search representing a word from the text.

  • Token: This is an occurrence of a term from the text of the field. It consists of term text, start and end offset, and a type.

Inverted index

Apache Lucene writes all the information to the structure called the inverted index. It is a data structure that maps the terms in the index to the documents, not the other way round, as the relational database does. You can think of an inverted index as a data structure, where data is term oriented rather than document oriented.

Let's see how a simple inverted index can look. For example, let's assume that we have the documents with only the title field to be indexed, and they look like the following:

  • Elasticsearch Server (document 1)

  • Mastering Elasticsearch (document 2)

  • Elasticsearch Essentials (document 3)

So, the index (in a very simple way) could be visualized as shown in the following table:

Term

Count

Document : Position

Elasticsearch

3

1:1, 2:2, 3:1

Essentials

1

3:2

Mastering

1

2:1

Server

1

1:2

As you can see, each term points to the number of documents it is present in, along with its position. This allows for a very efficient and fast search such as term-based queries. In addition to this, each term has a number connected to it: the count, telling Lucene how often it occurs.

Segments

Each index is divided into multiple write once and read many times segments. When indexing, after a single segment is written to disk, it can't be updated. For example, the information about deleted documents is stored in a separate file, but the segment itself is not updated.

However, multiple segments can be merged together in a process called segments merge. After forcing, segments are merged, or after Lucene decides it is time for merging to be performed, segments are merged together by Lucene to create larger ones. This can be I/O demanding; however, it is needed to clean up some information because during that time some information that is not needed anymore is deleted; for example, the deleted documents. In addition to this, searching with the use of one larger segment is faster than searching against multiple smaller ones holding the same data.

Of course, the actual index created by Lucene is much more complicated and advanced, and consists of more than the terms, their counts, and documents, in which they are present. We would like to tell you about a few of these additional index pieces because even though they are internal, it is usually good to know about them, as they can be very useful.

Norms

A norm is a factor associated with each indexed document and stores normalization factors used to compute the score relative to the query. Norms are computed on the basis of index time boosts and are indexed along with the documents. With the use of norms, Lucene is able to provide an index time-boosting functionality at the cost of a certain amount of additional space needed for norms indexation and some amount of additional memory.

Term vectors

Term vectors are small inverted indices per document. They consist of pairs-a term and its frequency-and can optionally include information about the term position. By default, Lucene and Elasticsearch don't enable term vectors indexing, but some functionalities, such as the fast vector highlighting, require them to be present.

Posting formats

With the release of Lucene 4.0, the library introduced the so-called codec architecture, giving developers control over how the index files are written onto the disk. One of the parts of the index is the posting format, which stores fields, terms, documents, term positions and offsets, and, finally, the payloads (a byte array stored at an arbitrary position in the Lucene index, which can contain any information we want). Lucene contains different posting formats for different purposes; for example; one that is optimized for high cardinality fields such as the unique identifier.

Doc values

As we have already mentioned, the Lucene index is the so-called inverted index. However, for certain features, such as aggregations, such an architecture is not the best one. The mentioned functionality operates on the document level and not the term level because Elasticsearch needs to uninvert the index before calculations can be done. Because of that, doc values were introduced and an additional structure was used for sorting and aggregations. The doc values store uninverted data for a field that they are turned on for. Both Lucene and Elasticsearch allow us to configure the implementation used to store them, giving us the possibility of memory-based doc values, disk-based doc values, and a combination of the two. Doc values are default in Elasticsearch since the 2.x release.

Document analysis

When we index a document into Elasticsearch, it goes through an analysis phase which is necessary in order to create the inverted indexes. It is a series of steps performed by Lucene which are depicted in following image:

Analysis is done by the analyzer, which is built of a tokenizer and zero or more filters, and can also have zero or more character filters.

A tokenizer in Lucene is used to divide the text into tokens, which are basically terms with additional information, such as its position in the original text and its length. The result of the tokenizer work is a so-called token stream, where the tokens are put one by one and are ready to be processed by filters.

Apart from the tokenizer, the Lucene analyzer is built of zero or more filters that are used to process tokens in the token stream. For example, it can remove tokens from the stream, change them, or even produce new ones. There are numerous filters and you can easily create new ones. Some examples of filters are as follows:

  • Lowercase filter: This makes all the tokens lowercase

  • ASCII folding filter: This removes non-ASCII parts from tokens

  • Synonyms filter: This is responsible for changing one token to another on the basis of synonym rules

  • Multiple language stemming filters: These are responsible for reducing tokens (actually the text part that they provide) into their root or base forms, the stem

Filters are processed one after another, so we have almost unlimited analysis possibilities with adding multiple filters one after another.

The last thing is the character filtering, which is used before the tokenizer and is responsible for processing text before any analysis is done. One of the examples of the character filter is the HTML tags removal process.

This analysis phase is applied during query time also. However, you can also choose the other path and not analyze your queries. This is crucial to remember because some of the Elasticsearch queries are being analyzed and some are not. For example, the prefix query is not analyzed and the match query is analyzed.

What you should remember about indexing and querying analysis is that the index should be matched by the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to be sure that the terms in the query are also lowercased and stemmed, or your queries will return no results at all.

Basics of the Lucene query language

Some of the query types provided by Elasticsearch support Apache Lucene query parser syntax. Because of this, it is crucial to understand the Lucene query language.

A query is divided by Apache Lucene into terms and operators. A term, in Lucene, can be a single word or a phrase (a group of words surrounded by double quote characters). If the query is set to be analyzed, the defined analyzer will be used on each of the terms that form the query.

A query can also contain Boolean operators that connect terms to each other forming clauses. The list of Boolean operators is as follows:

  • AND: This means that the given two terms (left and right operand) need to match in order for the clause to be matched. For example, we would run a query, such as apache AND lucene, to match documents with both apache and lucene terms in a document field.

  • OR: This means that any of the given terms may match in order for the clause to be matched. For example, we would run a query, such as apache OR lucene, to match documents with apache or lucene (or both) terms in a document field.

  • NOT: This means that in order for the document to be considered a match, the term appearing after the NOT operator must not match. For example, we would run a query lucene NOT Elasticsearch to match documents that contain the lucene term, but not the Elasticsearch term in the document field.

In addition to these, we may use the following operators:

  • +: This means that the given term needs to be matched in order for the document to be considered as a match. For example, in order to find documents that match the lucene term and may match the apache term, we would run a query such as +lucene apache.

  • -: This means that the given term can't be matched in order for the document to be considered a match. For example, in order to find a document with the lucene term, but not the Elasticsearch term, we would run a query such as +lucene -Elasticsearch.

When not specifying any of the previous operators, the default OR operator will be used.

In addition to all these, there is one more thing: you can use parentheses to group clauses together; for example, with something like the following query:

 Elasticsearch AND (mastering OR book) 

Querying fields

Of course, just like in Elasticsearch, in Lucene all your data is stored in fields that build the document. In order to run a query against a field, you need to provide the field name, add the colon character, and provide the clause that should be run against that field. For example, if you would like to match documents with the term Elasticsearch in the title field, you would run the following query:

 title:Elasticsearch 

You can also group multiple clauses. For example, if you would like your query to match all the documents having the Elasticsearch term and the mastering book phrase in the title field, you could run a query like the following code:

 title:(+Elasticsearch +"mastering book") 

The previous query can also be expressed in the following way:

+title:Elasticsearch +title:"mastering book" 

Term modifiers

In addition to the standard field query with a simple term or clause, Lucene allows us to modify the terms we pass in the query with modifiers. The most common modifiers, which you will be familiar with, are wildcards. There are two wildcards supported by Lucene, the ? and * terms. The first one will match any character and the second one will match multiple characters.

In addition to this, Lucene supports fuzzy and proximity searches with the use of the ~ character and an integer following it. When used with a single word term, it means that we want to search for terms that are similar to the one we've modified (the so-called fuzzy search). The integer after the ~ character specifies the maximum number of edits that can be done to consider the term similar. For example, if we would run a query, such as writer~2, both the terms writer and writers would be considered a match.

When the ~ character is used on a phrase, the integer number we provide is telling Lucene how much distance between the words is acceptable. For example, let's take the following query:

title:"mastering Elasticsearch" 

It would match the document with the title field containing mastering Elasticsearch, but not mastering book Elasticsearch. However, if we ran a query, such as title:"mastering Elasticsearch"~2, it would result in both example documents being matched.

We can also use boosting to increase our term importance by using the ^ character and providing a float number. Boosts lower than 1 would result in decreasing the document importance. Boosts higher than 1 would result in increasing the importance. The default boost value is 1. Please refer to the The changed default text scoring in Lucene - BM25 section in Chapter 2, The Improved Query DSL, for further information on what boosting is and how it is taken into consideration during document scoring.

In addition to all these, we can use square and curly brackets to allow range searching. For example, if we would like to run a range search on a numeric field, we could run the following query:

price:[10.00 TO 15.00] 

The preceding query would result in all documents with the price field between 10.00 and 15.00 inclusive.

In case of string-based fields, we also can run a range query; for example name:[Adam TO Adria].

The preceding query would result in all documents containing all the terms between Adam and Adria in the name field including them.

If you would like your range bound or bounds to be exclusive, use curly brackets instead of the square ones. For example, in order to find documents with the price field between 10.00 inclusive and 15.00 exclusive, we would run the following query:

price:[10.00 TO 15.00} 

If you would like your range bound from one side and not bound by the other, for example querying for documents with a price higher than 10.00, we would run the following query:

price:[10.00 TO *] 

Handling special characters

In case you want to search for one of the special characters (which are +, -, &&, ||, !, (, ), { }, [ ], ^, ", ~, *, ?, :, \, /), you need to escape it with the use of the backslash (\) character. For example, to search for the abc"efg term you need to do something like abc"efg.

An overview of Elasticsearch

Although we've said that we expect the reader to be familiar with Elasticsearch, we would really like to give you a short introduction to the concepts of this great search engine.

As you probably know, Elasticsearch is a distributed full text search and analytic engine that is built on top of Lucene to build search and analysis-oriented applications. It was originally started by Shay Banon and published in February 2010. Since then, it has rapidly gained popularity within just a few years and has become an important alternative to other open source and commercial solutions. It is one of the most downloaded open source projects.

The key concepts

There are a few concepts that come with Elasticsearch, and their understanding is crucial to fully understand how Elasticsearch works and operates:

  • Index: A logical namespace under which Elasticsearch stores data and may be built with more than one Lucene index using shards and replicas.

  • Document: A document is a JSON object that contains the actual data in key value pairs. It is very important to understand that when a field is indexed for the first time into the index, Elasticsearch creates a data type for that field. Starting from version 2.x, a very strict type checking gets done.

  • Type: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name such as a user or a blog post, and a mapping including data types and the Lucene configurations for each field.

  • Mapping : As already mentioned in the An overview of Lucene section, all documents are analyzed before being indexed. We can configure how the input text is divided into tokens, which tokens should be filtered out, or what additional processing, such as removing HTML tags, is needed. This is where mapping comes into play-it holds all the information about the analysis chain. Besides the fact that Elasticsearch can automatically discover a field type by looking at its value, in most cases we will want to configure the mappings ourselves to avoid unpleasant surprises.

  • Node: A single instance of Elasticsearch running on a machine. Elasticsearch nodes can serve different purposes. Of course, Elasticsearch is designed to index and search our data, so the first type of node is the data node. Such nodes hold the data and search on them. The second type of node is the master node-a node that works as a supervisor of the cluster controlling other nodes' work. The third node type is the client node, which is used as a query router. The fourth type of node is the tribe node, which was introduced in Elasticsearch 1.0. The tribe node can join multiple clusters and thus act as a bridge between them, allowing us to execute almost all Elasticsearch functionalities on multiple clusters just like we would by using a single cluster. Elasticsearch 5.0 has also introduced a new type of node called the ingest node, which can be used for data transformation before the data gets indexed.

  • Cluster: A cluster is a single name under which one or more nodes/instances of Elasticsearch are connected to each other.

  • Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable. For the index, shards once created cannot be increased or decreased.

    Note

    A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching data as well as in high availability; in case the machine that holds the primary shard goes down, then the secondary shard becomes the primary shard automatically.

  • Replica: A duplicate copy of the data living in a shard for high availability. Having a replica also provides a faster search experience.

Working of Elasticsearch

Elasticsearch uses the zen discovery module for cluster formation. In 1.x, multicast was the default discovery used in Elasticsearch, but in 2.x unicast became the default discovery type. Although, multicast was available in Elasticsearch 2.x as a plugin. Multicast support has completely been removed from Elasticsearch 5.0

When an Elasticsearch node starts, it performs discovery and searches for the list of unicast hosts (master eligible nodes), which are configured in the elasticsearch.yml configuration file using the discovery.zen.ping.unicast.hosts parameter. By default, the default list of unicast hosts is ["127.0.0.1", "[::1]"] so that each node, when starting, does not form a cluster only with itself. We will have a detailed section on zen discovery and node configurations in Chapter 8, Elasticsearch Administration.

 

Introducing Elasticsearch 5.x


In 2015, Elasticsearch, after acquiring Kibana, Logstash, Beats, and Found, re-branded the company name as Elastic. According to Shay Banon, the name change is part of an initiative to better align the company with the broad solutions it provides: future products, and new innovations created by Elastic's massive community of developers and enterprises that utilize the ELK stack for everything from real-time search, to sophisticated analytics, to building modern data applications.

But having several products under one hood resulted in discord among them during the release process and started creating confusion for the users. This resulted in the ELK stack being renamed to Elastic Stack and the company decided to keep releasing all components of the Elastic Stack together. This is so that they will all share the same version number for all the products to keep speed with your deployments, simplify compatibility testing, and make it even easier for developers to add new functionality across the stack.

The very first GA release under Elastic stack is 5.0.0, which will be covered throughout this book. Further, Elasticsearch keeps pace with Lucene version releases to incorporate bug fixes and the latest features into Elasticsearch. Elasticsearch 5.0 is based on Lucene 6, which is a major release from Lucene with some awesome new features and a focus on improving the search speed. We will discuss Lucene 6 in upcoming chapters to let you know how Elasticsearch is going to have some awesome improvements, both from search and storage points of view.

Introducing new features in Elasticsearch

Elasticsearch 5.x has many improvements and has gone through a great refactoring, which caused removal/deprecation of some features. We will keep discussing the removed/improved/new features in upcoming chapters, but for now let's take an overview of the new and improved things in Elasticsearch.

New features in Elasticsearch 5.x

Following are some of the most important features introduced in Elasticsearch version 5.0:

  • Ingest node: This node is a new type of node in Elasticsearch, which can be used for simple data transformation and enrichment before actual data indexing takes place. The best thing is that any node can be configured to act as an ingest node and it is very lighter across the board. You can avoid Logstash for these tasks because the ingest node is a Java based implementation of the Logstash filter and comes as a default in Elasticsearch itself.

  • Index shrinking: By design, once an index is created, there is no provision of reducing the number of shards for that index and this brings a lot of challenges since each shard consumes some resources. Although this design still remains same, to make life easier for users, Elasticsearch has introduced a new _shrink API to overcome this problem. This API allows you to shrink an existing index into a newer index with a fewer number of shards.

    Note

    We will cover the ingest node and shrink API in detail under Chapter 9, Data Transformation and Federated Search.

  • Painless scripting language: In Elasticsearch, scripting has always been a matter of concern because of its slowness and for security reasons. Elasticsearch 5.0 includes a new scripting language called Painless, which has been designed to be fast and secure. Painless is still going through lots of improvements to make it more awesome and easily adaptable. We will cover it under Chapter 3, Beyond Full Text Search.

  • Instant aggregations: Queries have been completely refactored in 5.0; they are now parsed on the coordinating node and serialized to different nodes in a binary format. This allows Elasticsearch to be much more efficient, with more cache-able queries, especially on data separated into time-based indices. This will cause a significant speed up for aggregations.

  • A new completion suggester: The completion suggester has undergone a complete rewrite. This means that the syntax and data structure for fields of type completion have changed, as have the syntax and response of the completion suggester requests. The completion suggester is now built on top of the first iteration of Lucene's new suggest API.

  • Multi-dimensional points: This is one of the most exciting features of Lucene 6, which empowers Elasticsearch 5.0. It is built using the k-d tree geospatial data structure to offer a fast single- and multi-dimensional numeric range and a geospatial point-in-shape filtering. A multi-dimensional point helps in reducing disk storage, memory utilization, and faster searches.

  • Delete by Query API: After much demand from the community, Elasticsearch has finally provided the ability to delete documents based on a matching query using the _delete_by_query REST endpoint.

New features in Elasticsearch 2.x

Apart from the features discussed just now, you can also benefit from all of the new features that came in Elasticsearch version 2.x. For those who have not had a look at the 2.x series, let's have a quick revamp of the new features which came with Elasticsearch under this series:

  • Reindex API: In Elasticsearch, re-indexing of documents is almost needed by every user, under several scenarios. The _reindex API makes this task very easy and you do not need to worry about writing your own code to do the same. This API, at the simplest level, provides the ability to move data from one index to another but also provides a great control while re-indexing the documents, such as using scripts for data transformation and many other parameters. You can take a look at the reindex API at following URL https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-reindex.html.

  • Update by query: Similar to re-indexing requirements, a user also demands to easily update the documents in place, based on certain conditions, without re-indexing the data. Elasticsearch provided this feature using the update_by_query REST endpoint in version 2.x.

  • Tasks API: The task management API, which is exposed by the _task REST endpoint, is used for retrieving information about the currently executing tasks on one or more nodes in the cluster. The following examples show the usage of the tasks API:

GET /_tasks 
GET /_tasks?nodes=nodeId1,nodeId2 
GET /_tasks?nodes=nodeId1&actions=cluster;* 
  • Since each task has an ID, you can either wait for the completion of the task or cancel the task in the following way:

POST /_tasks/taskId1/_cancel
  • Query profiler: The Profile API is an awesome tool to debug the queries and get the insights to know why a certain query is slow and take steps to improve it. This API was released in the 2.2.0 version and provides detailed timing information about the execution of individual components in a search request. You just need to send profile as true with your query object to get this working for you. For example:

  curl -XGET 'localhost:9200/_search' -d '{ 
    "profile": true,  
    "query" : { 
      "match" : { "message" : "query profiling test" } 
    } 
  }' 

The changes in Elasticsearch

The change list is very long and covering all the change details is out of the scope of this book, since most of the changes are internal level changes which a user should not be worried about. However, we will cover the most important changes an existing Elasticsearch user must know.

Although this book is based on Elasticsearch version 5.0, it is very important for the reader to get to know the changes being made between versions 1.x to 2.x. If you are new to Elasticsearch and are not aware about older versions, you can skip this section.

Changes between 1.x to 2.x

Elasticsearch version 2.x was focused on resiliency, reliability, simplification, and features. This release was based on Apache Lucene 5.x and specifically improves query execution and spatial search.

Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance or an upgrade. The bigger the cluster, the bigger the headache. Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite having whole data. Users have also reported more than a day of recovery time to restart a single node.

With 2.x, recovery of existing replica shards became almost instant, and there is more lenient reallocation, which avoids reshuffling and makes rolling upgrades much easier and faster. Auto-regulating feedback loops in recent updates also eliminates past worries about merge throttling and related settings.

Elasticsearch 2.x also solved many of the known issues that plagued previous versions, including:

  • Mapping conflicts (often yielding wrong results)

  • Memory pressures and frequent garbage collections

  • Low reliability of data

  • Security breaches and split brains

  • Slow recovery during node maintenance or rolling cluster upgrades

Mapping changes

Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of issues because of restrictions imposed by Lucene.

Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of an integer type while a field in another document type being of a string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.

The following are the most significant changes imposed by Elasticsearch version 2.x:

  • Field names must be referenced by full name.

  • Field names cannot be referenced using a type name prefix.

  • Field names can't contain dots.

  • Type names can't start with a dot (.percolator is an exception)

  • Type names may not be longer than 255 characters.

  • Types may no longer be deleted. So, if an index contains multiple types, you cannot delete any of the types from the index. The only solution is to create a new index and reindex the data.

  • index_analyzer and _analyzer parameters were removed from mapping definitions.

  • Doc values became default.

  • A parent type can't pre-exist and must be included when creating child type.

  • The ignore_conflicts option of the put mappings API got removed and conflicts cannot be ignored anymore.

  • Documents and mappings can't contain metadata fields that start with an underscore. So, if you have an existing document that contains a field with _id or _type, it will not work in version 2.x. You need to reindex your documents after dropping those fields.

  • The default date format has changed from date_optional_time to strict_date_optional_time, which expects a four-digit year, and a two-digit month and day, (and optionally, a two-digit hour, minute, and second). So a dynamic index set as "2016-01-01" will be stored inside Elasticsearch in "strict_date_optional_time||epoch_millis" format. Please note that if you have been using Elasticsearch older than 1.x then your date range queries might get impacted because of this. For example, if in Elasticsearch 1.x, you have two documents indexed with one having the date as 2017-02-28T12:00:00.000Z and the second having the date as 2017-03-01T11:59:59.000Z, and if you are searching for documents between February 28, 2017 and March 1, 2017, the following query could return both the documents:

{ 
    "range": { 
      "created_at": { 
        "gte": "2017-02-28", 
        "lte": "2017-03-01" 
      } 
    } 
  } 

But in version 2.0 onwards, the same query must use the complete date time to get the same results. For example.

{ 
    "range": { 
      "created_at": { 
        "gte": "2017-02-28T00:00:00.000Z", 
        "lte": "2017-03-01T11:59:59.000Z" 
      } 
    } 
  } 

In addition, you can also use the date match operation in combination with date rounding to get the same results as following query:

{ 
    "range": { 
      "doc.created_at": { 
        "lte": "2017-02-28||+1d/d", 
        "gte": "2017-02-28", 
        "format": "strict_date_optional_time" 
      } 
    } 
  } 

Query and filter changes

Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.

However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters became the same internal object, taking care of both document relevance and matching.

So, an Elasticsearch query that used to look like the following:

{ 
"filtered" : { 
"query": { query definition }, 
"filter": { filter definition } 
 } 
} 

It should now be written like this in version 2.x:

{ 
"bool" : { 
"must": { query definition }, 
"filter": { filter definition } 
} 
} 

Additionally, the confusion caused by choosing between a bool filter and an and / or filter has been addressed with the elimination of and / or filters, and replaced by the bool query syntax in the preceding example. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes frequently used filters and doesn't cache for segments with less than 10,000 documents or 3% of the index.

Security, reliability, and networking changes

Starting from 2.x, Elasticsearch now runs under the Java Security Manager enabled by default, which streamlines permissions after startup.

Elasticsearch has applied a durable-by-default approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are now atomic to prevent partially written files.

On the networking side, based on extensive feedback from system administrators, Elasticsearch removed multicasting, and the default zen discovery has been changed to unicast. Elasticsearch also now binds to the localhost by default, preventing unconfigured nodes from joining public networks.

Monitoring parameter changes

Before version 2.0.0, Elasticsearch used the SIGAR library for operating system-dependent statistics. But SIGAR is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info and node stats APIs:

  • network.* has been removed from nodes info and nodes stats.

  • fs.*.dev and fs.*.disk* have been removed from nodes stats.

  • os.* has been removed from nodes stats, except for os.timestamp, os.load_average, os.mem.*, and os.swap.*.

  • os.mem.total and os.swap.total have been removed from nodes info.

  • From the _stats API, id_cache parameter, which tells about parent-child data structure memory, usage has also been removed. The id_cache can now be fetched from fielddata.

Changes between 2.x to 5.x

Elasticsearch 2.x did not see too many releases in comparison to the 1.x series. The last release under 2.x was 2.3.4 and since then Elasticsearch 5.0 was released. The following are the most important changes an existing Elasticsearch user must know before adapting to the latest releases.

Note

Elasticsearch 5.x requires Java 8 so make sure to upgrade your Java versions before getting started with Elasticsearch.

Mapping changes

From a user's perspective, changes under mappings are the most important changes to know because a wrong mapping will disallow index creation or can lead to unwanted search. Here are the most important changes under this category that you need to know.

No more string fields

The string type is removed in favor of the text and keyword data type. In earlier versions of Elasticsearch, the default mapping for string based fields looked like the following:

      { 
         "content" : { 
            "type" : "string" 
         } 
      } 

Starting from version 5.0, the same will be created using the following syntax:

        { 
          "content" : { 
            "type" : "text", 
            "fields" : { 
              "keyword" : { 
                "type" : "keyword", 
                "ignore_above" : 256 
              } 
            } 
          } 
        } 

This allows you to perform a full-text search on the original field name and to sort and run aggregations on the sub-keyword field.

Note

Multi-fields are enabled by default for string-based fields and can cause extra overhead if a user is relying on dynamic mapping generation.

However, if you want to create specific mapping for string fields for full-text searches, it will be created as shown in the following example:

      { 
            "content" : { 
              "type" : "string" 
            } 
      } 

Similarly, a not_analyzed string field needs to be created using the following mapping:

      { 
            "content" : { 
              "type" : "keyword" 
            } 
      } 

Note

On all field data types (except for the deprecated string field), the index property now only accepts true/false instead of not_analyzed/no.

Floats are default

Earlier, the default data type for decimal fields used to be double but now it has been changed to float.

Changes in numeric fields

Numeric fields are now indexed with a completely different data structure, called the BKD tree. This is expected to require less disk space and be faster for range queries. You can read the details at the following link:

https://www.elastic.co/blog/lucene-points-6.0

Changes in geo_point fields

Similar to numeric fields, the geo_point field now also uses the new BKD tree structure and field parameters for geo_point fields are no longer supported: geohash, geohash_prefix, geohash_precision, and lat_lon. Geohashes are still supported from an API perspective, and can still be accessed using the .geohash field extension, but they are no longer used to index geo point data.

For example, in previous versions of Elasticsearch, the mapping of a geo_point field could look like the following:

"location":{ 
      "type": "geo_point", 
      "lat_lon": true, 
      "geohash": true, 
      "geohash_prefix": true, 
      "geohash_precision": "1m" 
} 

But, starting from Elasticsearch version 5.0, you can only create mapping of a geo_point field as shown in the following:

"location":{ 
      "type": "geo_point" 
    }  

Some more changes

The following are some very important additional changes you should be aware about:

  • Removal of site plugins. The support of site plugins has been completely removed from Elasticsearch 5.0.

  • Node clients are completely removed from Elasticsearch as they are considered really bad from a security perspective.

  • Every Elasticsearch node, by default, binds to the localhost and if you change the bind address to some non-localhost IP address, Elasticsearch considers the node as production-ready and applies various Bootstrap checks when the Elasticsearch node starts. This is done to prevent your cluster from being blown away in future if you forget to allocate enough resources to Elasticsearch. The following are some of the Bootstrap checks Elasticsearch applies: maximum number of file descriptors check, maximum map count check, and heap size check. Please go to this URL to ensure that you have set all the parameters for Bootstrap checks to be passed https://www.elastic.co/guide/en/elasticsearch/reference/master/bootstrap-checks.html.

    Note

    Please note that if you are using OpenVZ virtualization on your servers, then you may find it difficult in setting the maximum map count for running Elasticsearch in the production mode, as this virtualization does not easily allow you to edit the kernel parameters. So you should either speak to your sysadmin to configure vm.max_map_count correctly, or move to a platform where you can set it, for example kvm VPS.

  • _optimize endpoint which was deprecated in 2.x is finally removed and has been replaced by the Force Merge API. For example, an optimize request in version 1.x...

curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'

...should be converted to:


     curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
    

In addition to these changes, some major changes have been done in search, settings, allocation, merge, and scripting modules, along with cat and Java APIs, which we will cover in subsequent chapters.

 

Summary


In this chapter, we gave an overview of Lucene, discussing how it works, how the analysis process is done, and how to use the Apache Lucene query language. In addition to that, we discussed the basic concepts of Elasticsearch.

We also introduced Elasticsearch 5.x and covered the latest features introduced in version 2.x as well as 5.x. Finally, we talked about the most important changes and removal of features that Elasticsearch has implemented during the transition from 1.x to 5.x.

In the next chapter, you will learn about the new default scoring algorithm, BM25, and how it is better than the previous TF-IDF algorithm. In addition to that, we will discuss various Elasticsearch features, such as query rewriting, query templates, changes in query modules, and various queries to choose from, in a given scenario.

Latest Reviews (1 reviews total)
I have to read this book yet
Mastering Elasticsearch 5.x - Third Edition
Unlock this book and the full library FREE for 7 days
Start now