Chapter 2. The Improved Query DSL
In the previous chapter, we looked at the overview of Lucene, how it works, how analysis processing is done, and how to use the Apache Lucene query language. In addition to that, we discussed the basic concepts of Elasticsearch. We also introduced Elasticsearch 5.x and covered the latest features introduced in version 2.x as well as 5.x. We also talked about the most important changes/removals of features Elasticsearch has gone through during the transition from 1.x to 5.x. In this chapter, we will dive deep into Elasticsearch, focusing on the Query DSL. We will first go through Lucene similarity algorithm formulas before turning to advanced queries. By the end of this chapter, we will have covered the following topics:
The changed default text scoring in Lucene: BM25
Understanding precision and recall
How BM25 differs from TF-IDF
Elasticsearch Query DSL
Understanding bool query syntax
Which query you should use for your particular use case
Important changes in...
The changed default text scoring in Lucene - BM25
Scoring is the most important part of Apache Lucene. It is the process of calculating the score property of a document in a scope of a given query. A score is a factor that describes how well the document matches the query. For score calculation, Lucene supports many algorithms, but since the beginning of Lucene, TF-IDF (term frequency-inverse document frequency) has been the default scoring algorithm. With the release of Apache Lucene 6.0, one of the major changes in Lucene is the changed default scoring algorithm. The default algorithm is now BM25 (Best Matching). In this section, we will also cover two fundamental concepts of search relevancy: precision and recall, and after that, we'll look at the new default Apache Lucene scoring mechanism and how it differs from TF-IDF.
After executing a search query, an obvious question comes to mind: Have I found the most relevant documents or am I missing important documents...
The release of Elasticsearch 2.0.0 saw a major refactoring in Elasticsearch Query DSL, the interface provided by Elasticsearch to write queries in the JSON format. Covering complete Query DSL is out of the scope of this book. But for a person who doesn't have much prior experience with a full text search engine, the number of queries exposed by Elasticsearch can be overwhelming and very confusing. So, we have decided to cover the most frequently used queries, along with making readers aware about how to use queries and filters with revised Query DSL syntax. Thus, this book can serve the purposes of both types of readers: those who are aware of old versions as well as those who are starting with version 5.0.
Choosing the right query for the job
In this section, we will start with different categories of queries available in Elasticsearch, looking at which query to use in which situation.
Of course, categorizing queries is a hard task and the following list of categories is not the only correct one. If you were to ask other Elasticsearch users, they would provide their own categories, or say that each query can be assigned to more than a single category, and they would be right. We also think that there is no single way of categorizing the queries; however, in our opinion, each Elasticsearch query can be assigned to one (or more) of the following categories:
Basic queries: A category that groups queries allowing searching for a part of the index, either in an analyzed or a non-analyzed manner. The key point in this category is that you can nest queries inside a basic query. An example of a basic query is the term
query.
Compound queries: A category grouping queries that allows...
We have already talked about scoring, which is valuable knowledge, especially when trying to improve the relevance of our queries. We also think that when debugging your queries, it is valuable to know how all the queries are executed; therefore, it is because of this that we decided to include this section on how query rewrite works in Elasticsearch, why it is used, and how to control it.
If you have ever used queries, such as the prefix
query and the wildcard
query, basically any query that is said to be multiterm, you've probably heard about query rewriting. Elasticsearch does that because of performance reasons. The rewrite process is about changing the original, expensive query to a set of queries that are far less expensive from Lucene's point of view, and thus speed up the query execution. The rewrite process is not visible to the client, but it is good to know that we can alter the rewrite process behavior. For example, let's look at what Elasticsearch does...
When the application grows, it is very probable that the environment will start to be more and more complicated. In your organization, you probably have developers who specialize in particular layers of the application. For example, you have at least one frontend designer and an engineer responsible for the database layer. It is very convenient to have the development divided into several modules because you can work on different parts of the application in parallel, without the need for constant synchronization between individuals and the whole team. Of course, the book you are currently reading is not a book about project management, but search, so let's stick to that topic. In general, it would be useful, at least sometimes, to be able to extract all queries generated by the application, give them to a search engineer, and let him/her optimize them, in terms of both performance and relevance. In such a case, the application developers would only have to pass the query...
In this chapter, we explained scoring and its new default similarity ranking algorithm, BM25, along with explaining the difference between BM25 and TF-IDF, the previous ranking algorithm used in Apache Lucene. In addition to that, we discussed precision and recall, the fundamentals of search relevancy.
After that, we discussed Elasticsearch Query DSL in detail and covered the important queries with their use cases. We also saw the new bool
query syntax and how one can use filters within the query context of the bool
query. The chapter also covered a detailed discussion about using query rewrites and using search templates along with the Mustache template engine.
In the next chapter, you will learn about query rescoring and how search works in multimatch scenarios, such as cross-field matching and phrase matching. We will also cover various ways to use scripting in Elasticsearch.