Home Data Learning Elasticsearch

Learning Elasticsearch

By Abhishek Andhavarapu
books-svg-icon Book
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Introduction to Elasticsearch
About this book
Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. You can use Elasticsearch for small or large applications with billions of documents. It is built to scale horizontally and can handle both structured and unstructured data. Packed with easy-to- follow examples, this book will ensure you will have a firm understanding of the basics of Elasticsearch and know how to utilize its capabilities efficiently. You will install and set up Elasticsearch and Kibana, and handle documents using the Distributed Document Store. You will see how to query, search, and index your data, and perform aggregation-based analytics with ease. You will see how to use Kibana to explore and visualize your data. Further on, you will learn to handle document relationships, work with geospatial data, and much more, with this easy-to-follow guide. Finally, you will see how you can set up and scale your Elasticsearch clusters in production environments.
Publication date:
June 2017
Publisher
Packt
Pages
404
ISBN
9781787128453

 

Introduction to Elasticsearch

In this chapter, we will focus on the basic concepts of Elasticsearch. We will start by explaining the building blocks and then discuss how to create, modify and query in Elasticsearch. Getting started with Elasticsearch is very easy; most operations come with default settings. The default settings can be overridden when you need more advanced features.

I first started using Elasticsearch in 2012 as a backend search engine to power our Analytics dashboards. It has been more than five years, and I never looked for any other technologies for our search needs. Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before. To understand how this magic happens, we will briefly discuss how Elasticsearch works internally and then discuss how to talk to Elasticsearch. Knowing how it works internally will help you understand its strengths and limitations. Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. By the end of this chapter, we will have covered the following:

  • Basic concepts of Elasticsearch
  • How to interact with Elasticsearch
  • How to create, read, update, and delete
  • How does search work
  • Availability and horizontal scalability
  • Failure handling
  • Strengths and limitations
 

Basic concepts of Elasticsearch

Elasticsearch is a highly scalable open source search engine. Although it started as a text search engine, it is evolving as an analytical engine, which can support not only search but also complex aggregations. Its distributed nature and ease of use makes it very easy to get started and scale as you have more data.

One might ask what makes Elasticsearch different from any other document stores out there. Elasticsearch is a search engine and not just a key-value store. It's also a very powerful analytical engine; all the queries that you would usually run in a batch or offline mode can be executed in real time. Support for features such as autocomplete, geo-location based filters, multilevel aggregations, coupled with its user friendliness resulted in industry-wide acceptance. That being said, I always believe it is important to have the right tool for the right job. Towards the end of the chapter, we will discuss it’s strengths and limitations.

In this section, we will go through the basic concepts and terminology of Elasticsearch. We will start by explaining how to insert, update, and perform a search. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:

Database Table Row Column
Index Type Document Field

Document

Your data in Elasticsearch is stored as JSON (Javascript Object Notation) documents. Most NoSQL data stores use JSON to store their data as JSON format is very concise, flexible, and readily understood by humans. A document in Elasticsearch is very similar to a row when compared to a relational database. Let's say we have a User table with the following information:

Id Name Age Gender Email
1 Luke 100 M luke@gmail.com
2 Leia 100 F leia@gmail.com

The users in the preceding user table, when represented in JSON format, will look like the following:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "luke@gmail.com"
} {
"id": 2,
"name": "Leia",
"age": 100,
"gender": "F",
"email": "leia@gmail.com"
}

A row contains columns; similarly, a document contains fields. Elasticsearch documents are very flexible and support storing nested objects. For example, an existing user document can be easily extended to include the address information. To capture similar information using a table structure, you need to create a new address table and manage the relations using a foreign key. The user document with the address is shown here:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "luke@gmail.com",
"address": {
"street": "123 High Lane",
"city": "Big City",
"state": "Small State",
"zip": 12345
}
}

Reading similar information without the JSON structure would also be difficult as the information would have to be read from multiple tables. Elasticsearch allows you to store the entire JSON as it is. For a database table, the schema has to be defined before you can use the table. Elasticsearch is built to handle unstructured data and can automatically determine the data types for the fields in the document. You can index new documents or add new fields without adding or changing the schema. This process is also known as dynamic mapping. We will discuss how this works and how to define schema in Chapter 3, Modeling Your Data and Document Relations.

Index

An index is similar to a database. The term index should not be confused with a database index, as someone familiar with traditional SQL might assume. Your data is stored in one or more indexes just like you would store it in one or more databases. The word indexing means inserting/updating the documents into an Elasticsearch index. The name of the index must be unique and typed in all lowercase letters. For example, in an e-commerce world, you would have an index for the items--one for orders, one for customer information, and so on.

Type

A type is similar to a database table, an index can have one or more types. Type is a logical separation of different kinds of data. For example, if you are building a blog application, you would have a type defined for articles in the blog and a type defined for comments in the blog. Let's say we have two types--articles and comments.

The following is the document that belongs to the article type:

{
"articleid": 1,
"name": "Introduction to Elasticsearch"
}

The following is the document that belongs to the comment type:

{
"commentid": "AVmKvtPwWuEuqke_aRsm",
"articleid": 1,
"comment": "Its Awesome !!"
}

We can also define relations between different types. For example, a parent/child relation can be defined between articles and comments. An article (parent) can have one or more comments (children). We will discuss relations further in Chapter 3, Modeling Your Data and Document Relations.

Cluster and node

In a traditional database system, we usually have only one server serving all the requests. Elasticsearch is a distributed system, meaning it is made up of one or more nodes (servers) that act as a single application, which enables it to scale and handle load beyond what a single server can handle. Each node (server) has part of the data. You can start running Elasticsearch with just one node and add more nodes, or, in other words, scale the cluster when you have more data. A cluster with three nodes is shown in the following diagram:

In the preceding diagram, the cluster has three nodes with the names elasticsearch1, elasticsearch2, elasticsearch3. These three nodes work together to handle all the indexing and query requests on the data. Each cluster is identified by a unique name, which defaults to elasticsearch. It is often common to have multiple clusters, one for each environment, such as staging, pre-production, production.

Just like a cluster, each node is identified by a unique name. Elasticsearch will automatically assign a unique name to each node if the name is not specified in the configuration. Depending on your application needs, you can add and remove nodes (servers) on the fly. Adding and removing nodes is seamlessly handled by Elasticsearch.

We will discuss how to set up an Elasticsearch cluster in Chapter 2, Setting Up Elasticsearch and Kibana.

Shard

An index is a collection of one or more shards. All the data that belongs to an index is distributed across multiple shards. By spreading the data that belongs to an index to multiple shards, Elasticsearch can store information beyond what a single server can store. Elasticsearch uses Apache Lucene internally to index and query the data. A shard is nothing but an Apache Lucene instance. We will discuss Apache Lucene and why Elasticsearch uses Lucene in the How search works section later.

I know we introduced a lot of new terms in this section. For now, just remember that all data that belongs to an index is spread across one or more shards. We will discuss how shards work in the Scalability and Availability section towards the end of this chapter.

 

Interacting with Elasticsearch

The primary way of interacting with Elasticsearch is via REST API. Elasticsearch provides JSON-based REST API over HTTP. By default, Elasticsearch REST API runs on port 9200. Anything from creating an index to shutting down a node is a simple REST call. The APIs are broadly classified into the following:

  • Document APIs: CRUD (Create Retrieve Update Delete) operations on documents
  • Search APIs: For all the search operations
  • Indices APIs: For managing indices (creating an index, deleting an index, and so on)
  • Cat APIs: Instead of JSON, the data is returned in tabular form
  • Cluster APIs: For managing the cluster

We have a chapter dedicated to each one of them to discuss more in detail. For example, indexing documents in Chapter 4, Indexing and Updating Your Data and search in Chapter 6, All About Search and so on. In this section, we will go through some basic CRUD using the Document APIs. This section is simply a brief introduction on how to manipulate data using Document APIs. To use Elasticsearch in your application, clients in all major languages, such as Java, Python, are also provided. The majority of the clients acts as a wrapper around the REST API.

To better explain the CRUD operations, imagine we are building an e-commerce site. And we want to use Elasticsearch to power its search functionality. We will use an index named chapter1 and store all the products in the type called product. Each product we want to index is represented by a JSON document. We will start by creating a new product document, and then we will retrieve a product by its identifier, followed by updating a product's category and deleting a product using its identifier.

Creating a document

A new document can be added using the Document API's. For the e-commerce example, to add a new product, we execute the following command. The body of the request is the product document we want to index.

PUT http://localhost:9200/chapter1/product/1
{
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}

Let's inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
DOCUMENT JSON
HTTP METHOD PUT

The document's properties, such as title, author, the category, are also known as fields, which are similar to SQL columns.

Elasticsearch will automatically create the index chapter1 and type product if they don't exist already. It will create the index with the default settings.

When we execute the preceding request, Elasticsearch responds with a JSON response, shown as follows:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": true
}

In the response, you can see that Elasticsearch created the document and the version of the document is 1. Since you are creating the document using the HTTP PUT method, you are required to specify the document identifier. If you don’t specify the identifier, Elasticsearch will respond with the following error message:

No handler found for uri [/chapter1/product/] and method [PUT]

If you don’t have a unique identifier, you can let Elasticsearch assign an identifier for you, but you should use the POST HTTP method. For example, if you are indexing log messages, you will not have a unique identifier for each log message, and you can let Elasticsearch assign the identifier for you.

In general, we use the HTTP POST method for creating an object. The HTTP PUT method can also be used for object creation, where the client provides the unique identifier instead of the server assigning the identifier.

We can index a document without specifying a unique identifier as shown here:

POST http://localhost:9200/chapter1/product/
{
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}

In the above request, URL doesn't contain the unique identifier and we are using the HTTP POST method. Let's inspect the request:

INDEX chapter1
TYPE product
DOCUMENT JSON
HTTP METHOD POST

The response from Elasticsearch is shown as follows:

{
"_index": "chapter1",
"_type": "product",
"_id": "AVmKvtPwWuEuqke_aRsm",
"_version": 1,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": true
}

You can see from the response that Elasticsearch assigned the unique identifier AVmKvtPwWuEuqke_aRsm to the document and created flag is set to true. If a document with the same unique identifier already exists, Elasticsearch replaces the existing document and increments the document version. If you have to run the same PUT request from the beginning of the section, the response from Elasticsearch would be this:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 2,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"created": false
}

In the response, you can see that the created flag is false since the document with id: 1 already exists. Also, observe that the version is now 2.

Retrieving an existing document

To retrieve an existing document, we need the index, type and a unique identifier of the document. Let’s try to retrieve the document we just indexed. To retrieve a document we need to use HTTP GET method as shown below:

GET http://localhost:9200/chapter1/product/1

Let’s inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
HTTP METHOD GET

Response from Elasticsearch as shown below contains the product document we indexed in the previous section:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 2,
"found": true,
"_source": {
"title": "Learning Elasticsearch",
"author": "Abhishek Andhavarapu",
"category": "books"
}
}

The actual JSON document will be stored in the _source field. Also note the version in the response; every time the document is updated, the version is increased.

Updating an existing document

Updating a document in Elasticsearch is more complicated than in a traditional SQL database. Internally, Elasticsearch retrieves the old document, applies the changes, and re-inserts the document as a new document. The update operation is very expensive. There are different ways of updating a document. We will talk about updating a partial document here and in more detail in the Updating your data section in Chapter 4, Indexing and Updating Your Data.

Updating a partial document

We already indexed the document with the unique identifier 1, and now we need to update the category of the product from just books to technical books. We can update the document as shown here:

 POST http://localhost:9200/chapter1/product/1/_update
{
"doc": {
"category": "technical books"
}
}

The body of the request is the field of the document we want to update and the unique identifier is passed in the URL.

Please note the _update endpoint at the end of the URL.

The response from Elasticsearch is shown here:

{
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 3,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
}
}

As you can see in the response, the operation is successful, and the version of the document is now 3. More complicated update operations are possible using scripts and upserts.

Deleting an existing document

For creating and retrieving a document, we used the POST and GET methods. For deleting an existing document, we need to use the HTTP DELETE method and pass the unique identifier of the document in the URL as shown here:

DELETE http://localhost:9200/chapter1/product/1

Let's inspect the request:

INDEX chapter1
TYPE product
IDENTIFIER 1
HTTP METHOD DELETE

The response from Elasticsearch is shown here:

{
"found": true,
"_index": "chapter1",
"_type": "product",
"_id": "1",
"_version": 4,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
}
}

In the response, you can see that Elasticsearch was able to find the document with the unique identifier 1 and was successful in deleting the document.

 

How does search work?

In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search.

Importance of information retrieval

As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.

Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.

Simple search query

Let's say we have a User table as shown here:

Id Name Age Gender Email
1 Luke 100 M luke@gmail.com
2 Leia 100 F leia@gmail.com

Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:

select * from user where name like ‘%luke%’

To do a similar task in Elasticsearch, you can use the search API and execute the following command:

GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke

Let's inspect the request:

INDEX chapter1
TYPE user
FIELD name

Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:

{
"id": 1,
"name": "Luke",
"age": 100,
"gender": "M",
"email": "luke@gmail.com"
}

Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:

POST http://127.0.0.1:9200/chapter1/user/_search 
{
"query": {
"term": {
"name": "luke"
}
}
}

The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations.

Inverted index

Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.

We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.

Document1: Fear leads to anger

Document2: Anger leads to hate

Document3: Hate leads to suffering

In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.

Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.

Term Document
Fear 1
Anger 1,2
Hate 2,3
Suffering 3
Leads 1,2,3

Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:

Yosemite national park may be closed for the weekend due to forecast of substantial rainfall

We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:

Document Query
rainfall rain

To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.

Stemming

Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain.

We can configure stemming in Elasticsearch using Analyzers. We will discuss how to set up and configure analyzers in Chapter 3, Modeling Your Data and Document Relations.

Synonyms

Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words.

More examples:

Pen, Pen[s] -> Pen

Eat, Eating -> Eat

Phrase search

As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below:

Term Document
Anger 1,2
Leads 1,2,3

From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here:

Term Document
Fear 1:1
Anger 1:3, 2:1
Hate 2:3, 3:1
Suffering 3:3
Leads 1:2, 2:2, 3:2

Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query.

Term Document
anger 1:3, 2:1
leads 1:2, 2:2

Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index.

Apache Lucene

Apache Lucene is one of the most matured implementations of the inverted index. Lucene is an open source full-text search library. It's very high performing, entirely written in Java. Any application that requires text search can use Lucene. It allows adding full-text search capabilities to any application. Elasticsearch uses Apache Lucene to manage and create its inverted index. To learn more about Apache Lucene, please visit http://lucene.apache.org/core/.

We will talk about how distributed search works in Elasticsearch in the next section.

The term index is used both by Apache Lucene (inverted index) and Elasticsearch index. For the remainder of the book, unless specified the term index refers to an Elasticsearch index.
 

Scalability and availability

Let's say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create a index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below.

There are type of shards in Elasticsearch - primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time.

As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance.

In the next section, we will discuss the relation between node, index and shard.

Relation between node, index, and shard

Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences.

Three shards with zero replicas

We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows:

In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards.

Six shards with zero replicas

Now, let's recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes.

The distribution of shards for an index with six shards is as follows:

The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book.

Six shards with one replica

Let's now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure.

The solid border represents primary shards, and replicas are the dotted squares:

As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index. In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch.

Distributed search

One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards.

Let's take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica:

Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I'm currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed.

Failure handling

Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let’s say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas:

As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2.

Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here:

The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let's say Node2, which contains the primary shard S1, goes down as shown here:

Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster.

Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch.

Strengths and limitations of Elasticsearch

The strengths of Elasticsearch are as follows:

  • Very flexible Query API:
    • It supports JSON-based REST API.
    • Clients are available for all major languages, such as Java, Python, PHP, and so on.
    • It supports filtering, sort, pagination, and aggregations in the same query.
  • Supports auto/dynamic mapping:
    • In the traditional SQL world, you should predefine the table schema before you can add data. Elasticsearch handles unstructured data automatically, meaning you can index JSON documents without predefining the schema. It will try to figure out the field mappings automatically.
    • Adding/removing the new/existing fields is also handled automatically.
  • Highly scalable:
    • Clustering, replication of data, automatic failover are supported out of the box and are completely transparent to the user. For more details, refer to the Availability and Horizontal Scalability section.
  • Multi-language support:
    • We discussed how stemming works and why it is important to remove the difference between the different forms of root words. This process is completely different for different languages. Elasticsearch supports many languages out of the box.
  • Aggregations:
    • Aggregations are one of the reasons why Elasticsearch is like nothing out there.
    • It comes with very a powerful analytics engine, which can help you slice and dice your data.
    • It supports nested aggregations. For example, you can group users first by the city they live in and then by their gender and then calculate the average age of each bucket.
  • Performance:
    • Due to the inverted index and the distributed nature, it is extremely high performing. The queries you traditionally run using a batch processing engine, such as Hadoop, can now be executed in real time.
  • Intelligent filter caching:
    • The most recently used queries are cached. When the data is modified, the cache is invalidated automatically.

The limitations of Elasticsearch are as follows:

  • Not real time - eventual consistency (near real time):
    • The data you index is only available for search after 1 sec. A process known as refresh wakes up every 1 sec by default and makes the data searchable.
  • Doesn't support SQL like joins but provides parent-child and nested to handle relations.
  • Doesn't support transactions and rollbacks: Transactions in a distributed system are expensive. It offers version-based control to make sure the update is happening on the latest version of the document.
  • Updates are expensive. An update on the existing document deletes the document and re-inserts it as a new document.
  • Elasticsearch might lose data due to the following reasons:

We will discuss all these concepts in detail in the further chapters.

 

Summary

In this chapter, you learned the basic concepts of Elasticsearch. Elasticsearch REST APIs make most operations simple and straightforward. We discussed how to index, update, and delete documents. You also learned how distributed search works and how failures are handled automatically. At the end of the chapter, we discussed various strengths and limitations of Elasticsearch.

In the next chapter, we will discuss how to set up Elasticsearch and Kibana. Installing Elasticsearch is very easy as it's designed to run out of the box. Throughout this book, several examples have been used to better explain various concepts. Once you have Elasticsearch up and running, you can try the queries for yourself.

About the Author
  • Abhishek Andhavarapu

    Abhishek Andhavarapu is a software engineer at eBay who enjoys working on highly scalable distributed systems. He has a master's degree in Distributed Computing and has worked on multiple enterprise Elasticsearch applications, which are currently serving hundreds of millions of requests per day. He began his journey with Elasticsearch in 2012 to build an analytics engine to power dashboards and quickly realized that Elasticsearch is like nothing out there for search and analytics. He has been a strong advocate since then and wrote this book to share the practical knowledge he gained along the way.

    Browse publications by this author
Latest Reviews (3 reviews total)
Le contenu du livre est bon et la procédure d'achat est conviviale.
very clear to understand the class
This was exactly what I expected.
Learning Elasticsearch
Unlock this book and the full library FREE for 7 days
Start now