Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Elasticsearch Indexing
Elasticsearch Indexing

Elasticsearch Indexing: How to Improve User's Search Experience

By Huseyin Akdogan
R$120.99 R$80.00
Book Dec 2015 176 pages 1st Edition
eBook
R$120.99 R$80.00
Print
R$150.99
Subscription
Free Trial
eBook
R$120.99 R$80.00
Print
R$150.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Dec 30, 2015
Length 176 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783987023
Vendor :
Elastic
Category :
Table of content icon View table of contents Preview book icon Preview Book

Elasticsearch Indexing

Chapter 1. Introduction to Efficient Indexing

Elasticsearch is an open source full text search engine and data analysis tool that was developed in Java, is Apache Lucene-based, and scalable. A huge scale of data is produced at every moment in today's world of information technologies, in social media, in video sharing sites, and in medium and large-sized companies that provide services in communication, health, security, and other areas. Here we are talking about an information/data ocean, and we call this ocean briefly as big data in the world of information technology. An important part of this world of big data is unstructured, scattered, and insignificant when it is in isolation.

For this reason, some requirements such as recording, accessing, analyzing, and processing of data are significant. Like similar search engines, Elasticsearch is one of the tools that have been developed to deal with the problems mentioned previously, which belong to the world of big data.

What should I look for—high efficiency and/or performance—when Elasticsearch is used for the purposes mentioned earlier?

This book will target experienced developers who have used Elasticsearch before and want to extend their knowledge about how to effectively perform Elasticsearch indexing. Therefore, this book assumes that the reader already knows the basic issues and concepts of Elasticsearch. For example, what is Elasticsearch, how to install it, what purposes it serves, and so on. This book in your hand is intended to assist you with technical information and concrete applications about efficient indexing and relevant search result in Elasticsearch. This chapter aims to introduce and discuss the main topics for the purposes mentioned previously. To this end, we will look closely at how to store data by Elasticsearch and try to understand the document storage strategy. The relevant search result is closely related to data analysis. Hence, we will do an introduction to understanding the analysis process. In other chapters of this book, you will find the necessary discussions and examples for a better understanding of the following main issues:

  • How to store documents

  • The difference between the storable and searchable field

  • What the function of the analyzer is

  • How to improve relevant search results

Getting started


How does Elasticsearch store date and how does Elasticsearch store access data? These should be the first questions that come to mind when it comes to efficient indexing. The first thing to understand is how the documents are stored and accessed by Elasticsearch for efficient indexing and to improve the querying experience.

The purpose of this chapter is to prepare your mind for the topics that will be discussed throughout the book in more detail.

Understanding the document storage strategy


First of all, we need to depict the question: what is an Elasticsearch index?

The short answer is that an index is like a database in a relational database. Elasticsearch is a document-oriented search and analytics engine. Each record in Elasticsearch is a structured JSON document. In other words, each piece of data that is sent to Elasticsearch for indexing is a JSON document. All fields of the documents are indexed by default, and these indexed fields can be used in a single query. More information about this can be found in the next chapter.

Elasticsearch uses the Apache Lucene library for writing and reading the data from the index. In fact, Apache Lucene is at the heart of Elasticsearch.

Note

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. If you want to more information, please refer to https://lucene.apache.org/core/.

Every document sent to Elasticsearch is stored in Apache Lucene and the library stores all data in a data structure called an inverted index. An inverted index is a data structure that is mapped documents and terms. That means that an inverted index has a list of all the unique words that appear in any document. Also, it has a list of documents in which the collected unique word appears. Intended with this data structure, the performance of fast full-text searching is performed at low cost. The inverted index is a basic indexing algorithm used by search engines.

Note

The inverted index will be discussed in depth in the next chapter.

The _source field

As mentioned earlier, all fields of the documents are indexed by default in Elasticsearch, and these fields can be used in a single query. We usually send data to Elasticsearch because we want to either search or retrieve them.

The _source field is a metadata field automatically generated during indexing within Lucene that stores the actual JSON document. When executing search requests, the _source field is returned by default as shown in the following code snippet:

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":1,"created":true}

curl -XGET localhost:9200/my_index/_search?pretty
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 1,
            "_source": {
               "title": "What is an Elasticsearch Index",
               "category": "Elasticsearch",
               "content": "An index is like a...",
               "date": "2015-07-18",
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

Note

More information about the metadata fields can be found in Chapter 3, Basic Concepts of Mapping.

We sent a document to Elasticsearch that contains title, category, content, date, and tags fields for indexing. Then we ran the search command. The result of the search command is shown in the preceding snippet.

Because it is always able to return everything you send to Elasticsearch as a search result, Elasticsearch stores every document field within the _source field by default, which you send to it.

You can change this behavior if you want. This can be a preferred option because in some cases you may not need all fields to be returned in the search results. Also, it does not require a field to be stored in the _source field while it is searchable:

curl -XPUT localhost:9200/my_index/_mapping/article -d '{
  "article": {
    "_source": {
      "excludes": [
"date"
      ]
    }
  }
}'
{"acknowledged":true}

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":2,"created":false}

What did we do?

Firstly, we removed the date field from the _source field by changing the dynamic mapping. Then we sent the same document to Elasticsearch again for reindexing. In the next step, we will try to list the records that are greater than or equal to July 18, 2015 using the range query. The pretty parameter used in the following query tells Elasticsearch to return pretty-printed JSON results:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "range": {
      "date": {
        "gte": "2015-07-18"
      }
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 1,
            "_source": {
               "title": "What is an Elasticsearch Index",
               "category": "Elasticsearch",
               "content": "An index is like a...",
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

As you can see, we can search in the date field that although is not returned. This is because, as previously mentioned, all fields of the documents are indexed as default by Elasticsearch.

The difference between the storable and searchable field

Elasticsearch allows you to separately manage fields that can be searchable and/or storable. This is useful because in some cases we may want to index a field but may not want to store it or vice versa. In some cases, we might not want to do either.

On behalf of a better understanding of the subject, let's change the preceding example. Let's create the my_index again with the explicit mapping and disable the _source field:

curl -XDELETE localhost:9200/my_index
{"acknowledged": true}

curl -XPUT localhost:9200/my_index -d '{
  "mappings": {
    "article": {
      "_source": {
        "enabled": false
        },
      "properties": {
        "title": {"type": "string", "store": true},
        "category": {"type": "string"},
        "content": {"type": "string"},
        "date": {"type": "date", "index": "no"},
        "tags": {"type": "string", "index": "no", "store": true}
      }
    }
  }
}'

Firstly, we disabled the _source field for the article type. In this case, unless otherwise stated, any fields of the article type are not stored/returned. However, we would like to store some fields. In this case, we want to store only the title and tags fields using the store feature. If we enable the store option, we let Elasticsearch store the specified fields. Therefore, we explicitly specify which fields we want to store for future scenarios.

In addition, we don't want some fields to be indexed. This means that such fields will not be searchable. The date and the tags fields will not be searchable with the preceding configuration but, if requested, the tags field can be returned.

Note

Keep in mind that after disabling the _source field, you cannot make use of a number of features that come with the _source field, for example, the update API and highlighting.

Now, let's see the effect of the preceding configuration in practice:

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":1,"created":true}

curl -XGET localhost:9200/my_index/_search?pretty
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "my_index",
      "_type" : "article",
      "_id" : "1",
      "_score" : 1.0
    } ]
  }
}

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "range": {
      "date": {
        "gte": "2015-07-18"
      }
    }
  }
}'
{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

Firstly, we sent a document containing the date field value that is 2015-07-18 for indexing, and we ran the match_all query after (The search request does not have a body) and we did not see the _source field within hits.

Then we ran a range query on the date field because we want the documents where the date is greater than and equal to July 18, 2015. Elasticsearch did not return any documents to us because the date field does not have a default configuration. In other words, the date field was not indexed, therefore not searchable, so we do not see any retrieved documents.

Now let's run another scenario with following command:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "fields": ["title", "content", "tags"],
  "query": {
    "match": {
      "content": "like"
    }
  }
}'
{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13424811,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 0.13424811,
            "fields": {
               "title": [
"What is an Elasticsearch Index"
               ],
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

The document is returned to us as a result of the preceding query because the content field is searchable; but the field is not returned because it was not stored in Lucene.

Understanding the difference between storable and searchable (indexed) fields is important for indexing performance and relevant search results. It offers significant advantages for high-level users.

Analysis


We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.

Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.

Note

Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.

Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.

Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.

Tip

It is crucial to remember that all Elasticsearch queries are not being analyzed.

Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}

We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 68,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GIEQeR7spPlxvqlud",
            "_score": 0.19178301,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.

The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.

Note

If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.

In this case, Joe Jeffers would be two tokens (Joe and Jeffers). To see how the standard analyzer works, run the following command:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers'
{
  "tokens" : [ {
    "token" : "joe",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "jeffers",
    "start_offset" : 4,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We searched the letters joe and the consequent document containing Joe Jeffers was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).

Now let's examine the following example:

curl -XDELETE localhost:9200/company
{"acknowledged":true}

curl -XPUT localhost:9200/company -d '{
  "mappings": {
    "employee": {
      "properties": {
        "firstname": {"type": "string", "index": "not_analyzed"}
      }
    }
  }
}'
{"acknowledged":true}

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}

We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed value of the index option on the firstname field in the employee type. This means that the field is not analyzed at indexing time:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, Elasticsearch did not return a result to us with the match query because the firstname field is configured to the not_analyzed value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match" : {
        "firstname": "Joe Jeffers"
    }
  }
}'

The preceding command will return us the document we searched for. Now let's examine the following example:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match_phrase_prefix": {
      "firstname": "Joe"
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,****
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GOF2wR7spPlxvqmHY",
            "_score": 0.30685282,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.

Summary


In this chapter, we' looked at the important, main topics for efficient indexing and relevant search results: How to store documents? What is the difference between the storable and searchable field? What is the analysis process? What is the impact on the relevant search results? In addition to that, we've briefly discussed some of the basic concepts of Elasticsearch that are associated with Lucene (for example, inverted index and the _source field).

In the next chapter, you'll learn about the Elasticsearch index—what mapping is, what inverted index is, the denormalized data structure—and some other concepts related to this topic.

Left arrow icon Right arrow icon

Key benefits

  • Improve user’s search experience with the correct configuration
  • Deliver relevant search results – fast!
  • Save time and system resources by creating stable clusters

Description

Beginning with an overview of the way ElasticSearch stores data, you’ll begin to extend your knowledge to tackle indexing and mapping, and learn how to configure ElasticSearch to meet your users’ needs. You’ll then find out how to use analysis and analyzers for greater intelligence in how you organize and pull up search results – to guarantee that every search query is met with the relevant results! You’ll explore the anatomy of an ElasticSearch cluster, and learn how to set up configurations that give you optimum availability as well as scalability. Once you’ve learned how these elements work, you’ll find real-world solutions to help you improve indexing performance, as well as tips and guidance on safety so you can back up and restore data. Once you’ve learned each component outlined throughout, you will be confident that you can help to deliver an improved search experience – exactly what modern users demand and expect.

What you will learn

[*]Learn how ElasticSearch efficiently stores data – and find out how it can reduce costs [*]Control document metadata with the correct mapping strategies and by configuring indices [*]Use ElasticSearch analysis and analyzers to incorporate greater intelligence and organization across your documents and data [*]Find out how an ElasticSearch cluster works – and learn the best way to configure it [*]Perform high-speed indexing with low system resource cost [*]Improve query relevance with appropriate mapping, suggest API, and other ElasticSearch functionalities

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Dec 30, 2015
Length 176 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783987023
Vendor :
Elastic
Category :

Table of Contents

15 Chapters
Elasticsearch Indexing Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewer Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Introduction to Efficient Indexing Chevron down icon Chevron up icon
What is an Elasticsearch Index Chevron down icon Chevron up icon
Basic Concepts of Mapping Chevron down icon Chevron up icon
Analysis and Analyzers Chevron down icon Chevron up icon
Anatomy of an Elasticsearch Cluster Chevron down icon Chevron up icon
Improving Indexing Performance Chevron down icon Chevron up icon
Snapshot and Restore Chevron down icon Chevron up icon
Improving the User Search Experience Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.