Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Elasticsearch 5.x - Third Edition

You're reading from  Mastering Elasticsearch 5.x - Third Edition

Product type Book
Published in Feb 2017
Publisher Packt
ISBN-13 9781786460189
Pages 428 pages
Edition 3rd Edition
Languages

Table of Contents (20) Chapters

Mastering Elasticsearch 5.x - Third Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Revisiting Elasticsearch and the Changes The Improved Query DSL Beyond Full Text Search Data Modeling and Analytics Improving the User Search Experience The Index Distribution Architecture Low-Level Index Control Elasticsearch Administration Data Transformation and Federated Search Improving Performance Developing Elasticsearch Plugins Introducing Elastic Stack 5.0

Chapter 2. The Improved Query DSL

In the previous chapter, we looked at the overview of Lucene, how it works, how analysis processing is done, and how to use the Apache Lucene query language. In addition to that, we discussed the basic concepts of Elasticsearch. We also introduced Elasticsearch 5.x and covered the latest features introduced in version 2.x as well as 5.x. We also talked about the most important changes/removals of features Elasticsearch has gone through during the transition from 1.x to 5.x. In this chapter, we will dive deep into Elasticsearch, focusing on the Query DSL. We will first go through Lucene similarity algorithm formulas before turning to advanced queries. By the end of this chapter, we will have covered the following topics:

  • The changed default text scoring in Lucene: BM25

  • Understanding precision and recall

  • How BM25 differs from TF-IDF

  • Elasticsearch Query DSL

  • Understanding bool query syntax

  • Which query you should use for your particular use case

  • Important changes in...

The changed default text scoring in Lucene - BM25


Scoring is the most important part of Apache Lucene. It is the process of calculating the score property of a document in a scope of a given query. A score is a factor that describes how well the document matches the query. For score calculation, Lucene supports many algorithms, but since the beginning of Lucene, TF-IDF (term frequency-inverse document frequency) has been the default scoring algorithm. With the release of Apache Lucene 6.0, one of the major changes in Lucene is the changed default scoring algorithm. The default algorithm is now BM25 (Best Matching). In this section, we will also cover two fundamental concepts of search relevancy: precision and recall, and after that, we'll look at the new default Apache Lucene scoring mechanism and how it differs from TF-IDF.

Precision versus recall

After executing a search query, an obvious question comes to mind: Have I found the most relevant documents or am I missing important documents...

Re-factored Query DSL


The release of Elasticsearch 2.0.0 saw a major refactoring in Elasticsearch Query DSL, the interface provided by Elasticsearch to write queries in the JSON format. Covering complete Query DSL is out of the scope of this book. But for a person who doesn't have much prior experience with a full text search engine, the number of queries exposed by Elasticsearch can be overwhelming and very confusing. So, we have decided to cover the most frequently used queries, along with making readers aware about how to use queries and filters with revised Query DSL syntax. Thus, this book can serve the purposes of both types of readers: those who are aware of old versions as well as those who are starting with version 5.0.

Choosing the right query for the job


In this section, we will start with different categories of queries available in Elasticsearch, looking at which query to use in which situation.

Query categorization

Of course, categorizing queries is a hard task and the following list of categories is not the only correct one. If you were to ask other Elasticsearch users, they would provide their own categories, or say that each query can be assigned to more than a single category, and they would be right. We also think that there is no single way of categorizing the queries; however, in our opinion, each Elasticsearch query can be assigned to one (or more) of the following categories:

  • Basic queries: A category that groups queries allowing searching for a part of the index, either in an analyzed or a non-analyzed manner. The key point in this category is that you can nest queries inside a basic query. An example of a basic query is the term query.

  • Compound queries: A category grouping queries that allows...

Query rewrite explained


We have already talked about scoring, which is valuable knowledge, especially when trying to improve the relevance of our queries. We also think that when debugging your queries, it is valuable to know how all the queries are executed; therefore, it is because of this that we decided to include this section on how query rewrite works in Elasticsearch, why it is used, and how to control it.

If you have ever used queries, such as the prefix query and the wildcard query, basically any query that is said to be multiterm, you've probably heard about query rewriting. Elasticsearch does that because of performance reasons. The rewrite process is about changing the original, expensive query to a set of queries that are far less expensive from Lucene's point of view, and thus speed up the query execution. The rewrite process is not visible to the client, but it is good to know that we can alter the rewrite process behavior. For example, let's look at what Elasticsearch does...

Query templates


When the application grows, it is very probable that the environment will start to be more and more complicated. In your organization, you probably have developers who specialize in particular layers of the application. For example, you have at least one frontend designer and an engineer responsible for the database layer. It is very convenient to have the development divided into several modules because you can work on different parts of the application in parallel, without the need for constant synchronization between individuals and the whole team. Of course, the book you are currently reading is not a book about project management, but search, so let's stick to that topic. In general, it would be useful, at least sometimes, to be able to extract all queries generated by the application, give them to a search engineer, and let him/her optimize them, in terms of both performance and relevance. In such a case, the application developers would only have to pass the query...

Summary


In this chapter, we explained scoring and its new default similarity ranking algorithm, BM25, along with explaining the difference between BM25 and TF-IDF, the previous ranking algorithm used in Apache Lucene. In addition to that, we discussed precision and recall, the fundamentals of search relevancy.

After that, we discussed Elasticsearch Query DSL in detail and covered the important queries with their use cases. We also saw the new bool query syntax and how one can use filters within the query context of the bool query. The chapter also covered a detailed discussion about using query rewrites and using search templates along with the Mustache template engine.

In the next chapter, you will learn about query rescoring and how search works in multimatch scenarios, such as cross-field matching and phrase matching. We will also cover various ways to use scripting in Elasticsearch.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Elasticsearch 5.x - Third Edition
Published in: Feb 2017 Publisher: Packt ISBN-13: 9781786460189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}