Packt+ | Advance your knowledge in tech

You're reading from Elasticsearch Server: Second Edition

Product type Book

Published in Apr 2014

Publisher

ISBN-13 9781783980529

Pages 428 pages

Edition 1st Edition

Languages

Java

Concepts

Enterprise Search

Table of Contents (18) Chapters

Elasticsearch Server Second Edition

Credits

About the Author

Acknowledgments

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Getting Started with the Elasticsearch Cluster

Indexing Your Data

Searching Your Data

Extending Your Index Structure

Make Your Search Better

Beyond Full-text Searching

Elasticsearch Cluster in Detail

Administrating Your Cluster

Index

Chapter 5. Make Your Search Better

In the previous chapter, we learned how Elasticsearch indexing works when it comes to data that is not flat. We saw how to index tree-like structures. In addition to that, we indexed data that had an object-oriented structure. We also learned how to modify the structure of already created indices. Finally, we saw how to handle relationships in Elasticsearch by using nested documents as well as the parent-child functionality. By the end of this chapter, you will have learned the following topics:

Apache Lucene scoring
Using the scripting capabilities of Elasticsearch
Indexing and searching data in different languages
Using different queries to influence the score of the returned documents
Using index-time boosting
Words having the same meaning
Checking why a particular document was returned
Checking score calculation details

An introduction to Apache Lucene scoring

When talking about queries and their relevance, we can't omit information about scoring and where it comes from. But what is the score? The score is a parameter that describes the relevance of a document against a query. In the following section, we will discuss the default Apache Lucene scoring mechanism, the TF/IDF algorithm, and how it affects the returned document.

Note

The TF/IDF algorithm is not the only available algorithm exposed by Elasticsearch. For more information about available models, refer to the Different similarity models section in Chapter 2, Indexing Your Data, and our book, Mastering ElasticSearch, Packt Publishing.

When a document is matched

When a document is returned by Lucene, it means that Lucene matched the query we sent and that document has been given a score. The higher the score, the more relevant the document is from the search engine point of view. However, the score calculated for the same document on two different queries...

Scripting capabilities of Elasticsearch

Elasticsearch has a few functionalities where scripts can be used. You've already seen examples such as updating documents, filtering, and searching. Regardless of the fact that this seems to be advanced, we will take a look at the possibilities offered by Elasticsearch, because scripts are priceless for some use cases.

If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows:

Script: This property contains the actual script code.
Lang: This property defines the field that provides information about the script language. If it is omitted, Elasticsearch assumes mvel.
Params: This object contains parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Using parameters, we can write cleaner code. Scripts using parameters are executed faster than code with embedded constants because of caching.

Objects available during script execution...

Searching content in different languages

Till now, when discussing language analysis, we've talked mostly in theory. We didn't see an example regarding language analysis, handling multiple languages that our data can consist of, and so on. Now this will change, as we will discuss how we can handle data in multiple languages.

Handling languages differently

As you already know, Elasticsearch allows us to choose different analyzers for our data. We can have our data divided on the basis of whitespaces, have them lowercased, and so on. This can usually be done with the data regardless of the language—you should have the same tokenization on the basis of whitespaces for English, German, and Polish (that doesn't apply to Chinese, though). However, what if you want to find documents that contain words such as cat and cats by only sending the word cat to Elasticsearch? This is where language analysis comes into play with stemming algorithms for different languages, which allow the analyzed words to...

Influencing scores with query boosts

In the previous chapter, we learned what scoring is and how Elasticsearch calculates it. When an application grows, the need for improving the quality of search also increases. We call it the search experience. We need to gain knowledge about what is more important to the user and see how users use the search functionality. This leads to various conclusions; for example, we see that some parts of the documents are more important than the others or that particular queries emphasize one field at the cost of others. This is where boosting can be used.

The boost

Boost is an additional value used in the process of scoring. We already know it can be applied to the following:

query: This is a way to inform the search engine that the given query is a part of the complex query and is more significant than the others.
field: Several document fields are important for the user. For example, searching e-mails by Bill should probably list those from Bill first, followed...

When does index-time boosting make sense?

In the previous section, we discussed boosting queries. This type of boosting is very handy and powerful and fulfills its role in most situations. However, there is one case when the more convenient way is to use index-time boosting. This is the situation when we know which documents are important during the index phase. We gain a boost that is independent from a query at the cost of reindexing (we need to reindex the document when the boost value is changed). In addition to that, the performance is slightly better because some parts needed in the boosting process are already calculated at index time. Elasticsearch stores information about the boost as a part of normalization information. This is important because if we set omit_norms to true, we can't use index-time boosting.

Defining field boosting in input data

Let's look at the typical document definition, which looks as follows:

{
  "title" : "The Complete Sherlock Holmes",
  "author" : "Arthur...

Words with the same meaning

You may have heard about synonyms—words that have the same or similar meaning. Sometimes, you will want to have some words match when one of those words is entered into the search box. Let's recall our sample data from The example data section of Chapter 3, Searching Your Data; there was a book called Crime and Punishment. What if we want that book to be matched not only when the words crime or punishment are used, but also when using words like criminality and abuse. To perform this, we will use synonyms.

The synonym filter

In order to use the synonym filter, we need to define our own analyzer. Our analyzer will be called synonym and will use the whitespace tokenizer and a single filter called synonym. Our filter's type property needs to be set to synonym, which tells Elasticsearch that this filter is a synonym filter. In addition to that, we want to ignore case so that upper- and lowercase synonyms will be treated equally (set the ignore_case property to true...

Understanding the explain information

Compared to databases, using systems that are capable of performing full-text search can often be anything other than obvious. We can search in many fields simultaneously, and the data in the index can vary from the ones provided as the values of the document fields (because of the analysis process, synonyms, abbreviations, and others). It's even worse; by default, search engines sort data by relevance—a number that indicates how similar the document is to the query. The key here is how similar. As we already discussed, scoring takes many factors into account: how many searched words were found in the document, how frequent the word was, how many terms were present in the field, and so on. This seems complicated, and finding out why a document was found and why another document is better is not easy. Fortunately, Elasticsearch has some tools that can answer these questions, and we will look at them now.

Understanding field analysis

One of the common questions...

Summary

In this chapter, we learned how Apache Lucene scoring works internally. We've also seen how to use the scripting capabilities of Elasticsearch and how to index and search documents in different languages. We've used different queries to alter the score of our documents and modify it so it fits our use case. We've learned about index-time boosting, what synonyms are, and how they can help us. Finally, we've seen how to check why a particular document was a part of the result set and how its score was calculated.

In the next chapter, we'll go beyond full-text searching. We'll see what aggregations are and how we can use them to analyze our data. We'll also see faceting, which also allows us to aggregate our data and bring meaning to it. We'll use suggesters to implement spellchecking and autocomplete, and we'll use prospective search to find out which documents match particular queries. We'll index binary files and use geospatial capabilities to search our data with the use of geographical...