Improving the User Search Experience

 In this article by Hüseyin Akdoğan, author of the book ElasticSearch Indexing, we will examine the Elasticsearch Suggest API to correct user’s spelling mistakes and we will look closely at the various functionalities provided by Elasticsearch to improve the relevancy of search results. By the end of this article, we will have covered:

  • How to correct user’s spelling mistakes
  • How to use the term suggester
  • How to use the phrase suggester
  • How to provide autocomplete functionality for the user
  • How to use boosting
  • How to use synonyms

(For more resources related to this topic, see here.)

Correction of users’ spelling mistakes

Typos and spelling mistakes are often encountered due to many reasons. Therefore, correcting typos and user spelling mistakes is an integral part of a good search experience. When you search for a phrase that is close to another frequently searched phrase, you may see the did you mean phrase, which helps correct users’ spelling mistakes as search engines use this form to improve the user search experience. For such a case, this is what Google shows us when we type in threat safe instead of thread safe. Take a look at the following screenshot as an example:

Elasticsearch allows us to use the Suggest API functionality. In this section, we will look at how to use the Suggest API both in simple use case scenarios and the basic configuration settings.

Suggesters

The Suggest API suggests similar terms based on text that you provided by using a suggester. Elasticsearch allows us to use three suggesters that provide three different functionalities. These are term, phrase, and completion. The term and phrase suggesters allow us to correct spelling mistakes. The completion suggester provides the autocomplete functionality. A suggestion request can be used in two ways:

  • Using the REST _suggest endpoint
  • Defined alongside the query part of a _search request

Now let's' examine how we can use these formats.

Using the _suggest REST endpoint

When using the _suggest REST endpoint, you must provide text for suggestions and the type of the suggester to use. Endpoint provides suggestions that are similar to the text provided. The following is an example of the _suggest REST endpoint. We would like to get suggestions for the word jama. Of course, we've misspelled it on purpose to understand the suggestion's' working logic:

curl -XGET localhost:9200/my_index/_suggest?pretty -d '{
my_suggestion" : {
    "text : """jama","
    "term" : {
      "field : "_""all"
    }
  }
}'

In the preceding example, first we specified a name for the suggestion request. In this example, it is my_suggestion. Then we specified the text that we want to suggest, to be returned by using the text parameter. Afterward, we added the suggester type. Here, a term suggester is used. The term suggester object contains its configuration, and the field property defines the field that we want to use for suggestions. In this example, we specified that we wanted to use the _all field. Now let's look at the example response:

{
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "my_suggestion":" [
      {
         "text": """jama","
         "offset":" 0,
         "length":" 4,
         "options":" [
            {
               "text": """java","
               "score":" 0.75,
               "freq":" 415
            },
            {
               "text": """jaka","
               "score":" 0.75,
               "freq":" 109
            },
            {
               "text": """jakas","
               "score":" 0.5,
               "freq":" 37
            },
            {
               "text": """j2me","
               "score":" 0.5,
               "freq":" 26
            },
            {
               "text": """jakao","
               "score":" 0.5,
               "freq":" 13
            }
         ]
      }
   ]
}

As you can see in the preceding response, the output returns a list of suggestions for the text provided (that is, term) to us that was present in the text parameter of our my_suggestion section. The term suggester will return an array of possible suggestions with additional information for each term. Looking at the data returned for the term jama, we can see the options array that contains suggestions.

In other words, each entry in this array is a suggestion for the provided term. If Elasticsearch does not find any suggestions for the provided term, the options field will be empty. Properties and their meanings are as follows in each matching term object of the options array that is returned by Elasticsearch:

  • Text: A text  parameter of the suggestion for the term provided by user.
  • Score: The score of the suggestion. The score is a factor indicating how close the suggestion is to the provided term. A score can mean better the suggestion. Note that the terms java and jaka received the highest score according to the preceding response.
  • Frequency: The frequency of the suggestion. The frequency indicates how many times the term appears in the documents of an index. When you see high frequency, this means that more documents will have the suggested term in their fields and that the suggested term is an appropriate suggestion for users. Note that the term java received the highest frequency value according to the preceding response.

In addition, keep in mind that you can send more than one suggestion at a time by adding multiple suggestion names. For example, in addition to the term jama, we can also ask for a suggestion for rumy (of course we have again made a misspelling on purpose) as shown here:

curl -XGET localhost:9200/my_index/_suggest?pretty -d '{
  "first_suggestion" : {
    "text : """jama","
    "term" : {
      "field : "_""all"
    }
  },
 "second_suggestion" : {
  "text : """rumy","
  "term" : {
   "field : "_""all"
  }
 }
}'

Suggest object inclusion in the query

A suggest request can be defined alongside the query part of the _search request as follows:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
 "query":" {
   "match":" {
     "description": """java"
   }
 },
 "suggest" : {
  "first_suggestion" : {
    "text : """j2se","
    "term" : {
      "field : "_""all"
    }
  }
 }
}''

Unlike the _suggest REST endpoint use, when we include suggestion requests in a query, the documents are also returned to us with the suggestions even if we do not specify the query (the match_all query executed by Elasticsearch in this case). At this point, it is important to know that the returned suggestions are independent of the returned result by the specified query.

As we mentioned at the beginning of the Suggesters section, Elasticsearch allows us to use three suggesters. Now that we now know how to use a suggestion request with the REST _suggest endpoint and as part of a search request, now let’s’ examine these three suggesters.

Term suggester

The term suggester suggests terms based on the edit distance. An edit distance refers to the number of characters that would need to be changed to make the terms match. A term with a lower distance number is considered to be a better match than a term with a higher distance number. Consider the case of jama returning to java that we previously examined. In order to change the term jama to java, we need to change the letter m to v, so this means a distance of 1. The text provided for suggestion is analyzed before terms are suggested, and the terms suggested by Elasticsearch are indicated as per the provided and suggested text.

The term suggester does not take the query into account even when it is a part of a request.

Configuring the term suggester

Elasticsearch provides many configuration properties to configure the term suggester in order to suit our needs. Now we will talk about these configuration properties.

Common suggest options

The following options can be used for all the suggesters. The available options are:

  • text: The suggest text. We want to receive suggestions for the text itself. This option is required and can be set globally or as per the suggestion.
  • field: The field option determines which field to use to fetch the suggestions. It is another required option and it can be set globally or as per the suggestion.
  • analyzer: This option's' value must be an analyzer name that can be used to analyze the text provided in the text parameter. If value is not set, Elasticsearch will use the search analyzer of the suggest field.
  • size: This option defines the maximum number of suggestions that need to be returned as per the suggest text token. The default value is 5.
  • sort: This option allows us to specify how suggestions are sorted in the result returned by Elasticsearch. There are two values available—score and frequency. The default value is score. When the score value is used, the suggestions will be sorted based on the score first, then the frequency, and then the term itself. If the frequency is used, the suggestions will be sorted by frequency first, then by the similarity score and then by the term itself.
  • suggest_mode: This option allows us to control which suggestions will be included in the Elasticsearch response. There are three values available: missing, popular, and always. The default value is missing. When missing is used, Elasticsearch will generate suggestions for the provided term in the text parameter only if it does not exist in the index. If when the value called “popular” is used, Elasticsearch will only suggest terms that exist in more documents than the original term. Or, when the last possible value always is used, Elasticsearch will suggest any matching suggestions for each of the words in the text parameter.

Other and additional term suggester options

In addition to the common suggest options, there are additional options we can use for the term suggester. These options are:

  • lowercase_terms: When this option set to true, Elasticsearch will make all suggest terms lowercase after analysis.
  • max_edits: This option defines the value of the maximum edit distance and can only take a value between 1 and 2. The default value is 2. When setting this value to 1, you can see fewer but better suggestions in the result.
  • prefix_length: This option allows us to set how many of the suggestion prefix characters must match the prefix characters of the provided term. The default value is 1. Increasing this number improves spellcheck’s performance because usually, spelling mistakes dodo not appear at the beginning of a word.
  • min_word_length: This option defines the minimum length of a suggestion that is to be returned. The default value is 4.
  • shard_size: This option defines the maximum number of suggestions that will be read from each individual shard. The default value is specified by the size parameter. The terms are partitioned among the shards (unless we have a single shard index created) because of the sharding process. Therefore, if you set this option to a value higher than the size parameter, it can be useful in creating a more accurate document frequency.
  • max_inspections: This option is a factor that defines how many candidates Elasticsearch will look at in order to find the terms on the shard level that can be used as suggestions. The default value is 5. The factor will be used as a multiplier for the shards_size option. Setting a higher value than the default value can improve accuracy, but it leads to a cost in performance.
  • min_doc_freq: This option allows us to define the low limit for the number of documents to appear. For example, if you set the option to 2, this means the suggestion must appear in at least two documents in a given shard. Note that this value is counted per shard, and is not globally counted as one. This option’s default value is 0, which means the option is not enabled. When we set option’s values higher than 0, it can improve the quality of suggestions returned by only suggesting high frequency terms. This option can be specified as a percentage for lower values than 1. For example, 0.02 means 2%. The shard level document frequencies are used for this option.
  • max_term_freq: This option defines the maximum number of documents that a suggest text token can exist in order to be included for spellchecking. Similar to the min_doc_freq parameter, it can either be a relative percentage number (for example, 0.4 means 4%) or can be provided as an absolute number. This value is per shard frequency. When a value higher than 1 is specified, then fractional value cannot be specified. The default value is 0.01. If you define a higher value for this option, the overall performance of the spellchecker will be better. In addition, this option is very useful as it excludes high frequency terms from being spellchecked, which are usually correct terms. The shard level document frequencies are used for this option.

The phrase suggester

The phrase suggester is an extended version of the term suggester. It uses n-gram language models to calculate how good the suggestion is and selects entire corrected phrases instead of individual weighted tokens. This means that whole phrases will be returned instead of individual terms. The n-gram approach gets a contiguous sequence of N terms from a given text. In other words, it divides terms in the string into grams. For example, if we would like to divide the word elasticsearch into bi-grams, it would look like this (when a two letter n-gram is used): el la as st ti ic cs se ea ar rc ch.

If you want more information about the n-gram language models, please see http://en.wikipedia.org/wiki/Language_model#N-gram_models.

The best way to describe the phrase suggester is an example so you can see it in action. We need to create test data for this reason. Let's' start by indexing five simple documents::

curl -XPOST 'localhost':9200/my_index/article/1' -d {"'"title": """Introduction to ElasticSearch Data Analytics"}'"'
curl -XPOST 'localhost':9200/my_index/article/2' -d {"'"title": """Big Data search and analysis by ElasticSearch"}'"'
curl -XPOST 'localhost':9200/my_index/article/3' -d {"'"title": """Real-time Data Analytics with Elasticsearch "}'"'
curl -XPOST 'localhost':9200/my_index/article/4' -d {"'"title": """Data Mining with ElasticSearch Data Analytics"}'"'
curl -XPOST 'localhost':9200/my_index/article/5' -d {"'"title": """Elasticsearch Analytics with Kibana"}'"'
Okay, let's' see how we run a phrase suggester request at now:
curl -XPOST localhost:9200/my_index/_search?pretty -d '{
  "size":" 0,
  "suggest":" {
   "text": """elasticsarch data analytis","
      "phrase_suggestion" : {
        "phrase":" {
          "field": """title"
      }
    }
 }
}'

When we examine the preceding command, we can see that it is not very different from the command that we ran for the term suggester except that we specified the phrase type instead of the term type. The response to the preceding command is as follows:

{
   "took":" 15,
   "timed_out":" false,
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "hits":" {
      "total":" 5,
      "max_score":" 0,
      "hits":" []
   },
   "suggest":" {
      "phrase_suggestion":" [
         {
            "text": """elasticsarch data analytis","
            "offset":" 0,
            "length":" 26,
            "options":" [
               {
                  "text": """elasticsearch data analytics","
                  "score":" 0.114973485
               },
               {
                  "text": """elasticsearch data analytis","
                  "score":" 0.08818061
               },
               {
                  "text": """elasticsarch data analytics","
                  "score":" 0.08641694
               },
               {
                  "text": """elasticsearch data analysis","
                  "score":" 0.070414856
               }
            ]
         }
      ]
   }
}

As you can see, when the phrase suggester is used, Elasticsearch will be the whole phrases returned from document instead of a single word/term for each term from the text field. The returning array includes the most likely corrected spelling suggestions and is sorted based on their score. In this case, we first received the expected correction from the Elasticsearch data analytics, while the second correction is relatively less successful in that only one of the errors is corrected.

Note that the request is executed with the max_errors parameter even if we did not specify this explicitly. This parameter defines the corrections to be returned with how many misspelled terms there are. There are misspelled terms in the returned array for the default value of this parameter is 1.0. Now let's look at what parameters of the phrase suggester are available for usage.

Configuring the phrase suggester

As mentioned earlier, the phrase suggester has been extended from the term suggester. This means there is an inheritance relationship between the phrase suggester and the term suggester, plus the phrase suggester has all the features of the term suggester. Therefore, the phrase suggester can also make use of the common configuration options provided by term suggester (refer to the Common suggest options section in this article). In addition to these features, the phrase suggester exposes the following basic options:

  • field: This option determines which field to use to fetch the suggestions that we use to perform n-gram lookups for the language model. It is a required option.
  • gram_size: This option defines the maximum size of the n-grams in the field that is specified by the field option. If the specified field does not contain n-grams, this option should be set to 1 or be omitted. This behavior is recommended because Elasticsearch will try to detect the gram size by itself when this option is not set.
  • real_word_error_likelihood: This option defines the possibility of a term being a misspelled even if the term exists in the index. The default value is 0.95, corresponding to 5%, which tells Elasticsearch that 5% of all the terms that exist in its index are misspelled. Note that when given a low value, this option will result in more terms being taken as misspelled even though they may be correct.
  • confidence: This option defines a threshold value for suggestion candidates that will be included in the result. For example, when the confidence value is 1.0, Elasticsearch will only return suggestions that score higher than this. If it is set to 0.0, Elasticsearch will result in returning all the suggestions no matter what their scores are with the limited size parameter. The default value is 1.0.
  • max_errors: This option defines the maximum percentage of terms that can be misspellings in order to create a correction. This option accepts an integer number or a float value in the range between 0 and 1, which will be treated as a percentage value. The default value is 1.0, which means that at most one misspelled term is returned for only one correction. When a float value is used, it will specify the percentage of terms that can be erroneous. If we specify an integer number, Elasticsearch will treat it as a maximum number of misspelled terms. When given too high a value, this option can negatively affect performance.
  • separator: This option defines the separator that will be used to divide terms in the bigram field. The whitespace character is used as a separator when this option is not set.
  • highlight: This option allow us to use suggestions highlighting. When it is being configured pre_tag and post_ tag should be used to specify which prefix and postfix should be used. For example, if we would like to surround the suggestions with the <em> and </em> tags, we should set the pre_ tag to <em> and the post_ tag to </em>.
  • collate: This option allows us to check each suggestion against a specified query or filter to prune suggestions for which no matching documents exist in the index. The query or filter must be specified with this option and it is run as a template query. The query or filter must contain the {{suggestion}} variable. The current suggestion is automatically made usable on this variable. Also, you can specify your own template params. When the additional parameter called prune is set to true, the suggestions will have an additional option called collate_match. The default value of prune is false.

Now let's look at an example of using some of the parameters mentioned .earlier For example, if you want to use highlighting the command, it would look as follows:

curl -XPOST localhost:9200/my_index/_search -d '{
  "size":" 0,
  "suggest":" {
   "text": """elasticsarch data analytis","
      "phrase_suggestion" : {
        "phrase":" {
          "field": """title","
          "real_word_error_likelihood" : 0.95,
          "max_errors" : 0.5,
          "highlight":" {
          "pre_tag": "<""em>","
          "post_tag": "</""em>"
        },
        "collate" : {
          "prune" : true,
          "query" : {
            "match" : {
              "{{"field}}": "{{""suggestion}}"
            }
          },
          "params":" {
            "field": """title"
          }
       }
      }
    }
  }
}'

The result returned by Elasticsearch for the preceding query would be as follows:

{
   "took":" 17,
   "timed_out":" false,
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "hits":" {
      "total":" 5,
      "max_score":" 0,
      "hits":" []
   },
   "suggest":" {
      "phrase_suggestion":" [
         {
            "text": """elasticsarch data analytis","
            "offset":" 0,
            "length":" 26,
            "options":" [
               {
                  "text": """elasticsearch data analytics","
                  "highlighted": "<""em>elasticsearch</em> data <em>analytics</em>","
                  "score":" 0.114973485,
                  "collate_match":" true
               },
               {
                  "text": """elasticsearch data analytis","
                  "highlighted": "<""em>elasticsearch</em> data analytis","
                  "score":" 0.08818061,
                  "collate_match":" true
               },
               {
                  "text": """elasticsarch data analytics","
                  "highlighted": """elasticsarch data <em>analytics</em>","
                  "score":" 0.08641694,
                  "collate_match":" true
               },
               {
                  "text": """elasticsearch data analysis","
                  "highlighted": "<""em>elasticsearch</em> data <em>analysis</em>","
                  "score":" 0.070414856,
                  "collate_match":" true
               }
            ]
         }
      ]
   }
}

As expected, the suggestions were highlighted nicely.

The completion suggester

The completion suggester provides a basic auto complete functionality instead of doing spell correction, unlike other suggesters. Actually, it is a so-called prefix suggester based on the Finite State Transducer (FST) data structure. In this structure, available suggestions can be stored as more than one output value for each input string value.

If you want more information on the FTS data structure, please refer to http://en.wikipedia.org/wiki/Finite_state_transducer.

Prefix suggestions are faster than other suggestions. They are stored on an FTS-like data structure as part of your index during index time. For this reason, the completion suggester allows really fast loads and executions of the suggestions because it does not perform any calculations during query time.

Mapping the configuration for the completion suggester

In order to use this feature, we need to dedicate one field, which will be called completion and we have to specify a special mapping for it. Thus, the field stores the FST-like structure in the index. In order to illustrate how to use this suggester, let's' create an index to search for movie directors with the autocomplete feature. Next to a director name, we want to return the identifiers of the movie she/he directed in order to search for them with an additional query. We create the directors index by running the following command:

curl -XPOST localhost:9200/imdb -d '{
 "mappings":" {
  "director":" {
   "properties":" {
    "name":" {
        "type": """string"
    },
    "completion_suggest":" {
     "type": """completion","
     "analyzer": """simple","
     "search_analyzer": """simple","
     "payloads":" true
    }
   }
  }
 }
}'
{"acknowledged":"true}

Okay. Now we have an index, and it will contain a single type called “director.” We specified two fields for each document, which will be stored under this type. These fields are the name and completion_suggest. The first field is the name of the director and the second field is the field we will use for the autocomplete function. Note that we defined the completion_suggest field using the completion type, which will result in storing the FST-like structure in the index. The mapping of the completion suggester supports the following parameters:

  • type: This option is required and should be set to completion.
  • analyzer: This option defines the analyzer to use during indexing time. The default value is simple.
  • search_analyzer: This option defines the analyzer to use during query time. The default value is analyzer.
  • payloads: This option defines whether or not stores for payloads. The default value is false. It allows you to return additional information along with the suggestion when set to true.
  • preserve_separators: This option defines whether or not the separators are taken into consideration. The default value is true. For example, when it is set to false, you could find a field starting with Real Madrid if you suggest for realm.
  • preserve_position_increments: This option defines whether or not the position increments are enabled. The default value is true. For example, when it is set to false, you could find a field starting with The Godfather, if you suggest g.
  • max_input_length: This option defines the limit for the length of a single input. The default value is 50 UTF-16 code points.

Indexing on completion field

We will now index a document describing Andrei Tarkovsky and we will provide some additional information about his movies. Let's look at the following code:

curl -XPOST localhost:9200/imdb/director/1 -d '{
 "name": """Andrei Tarkovsky","
 "completion_suggest":" {
  "input": [ """andrei", """arsenyevich", """tarkovsky" ],
  "output": """Andrei Arsenyevich Tarkovsky","
  "payload": { """movies : [ "Ivan's""' Childhood", """Andrei Rublev", """Solaris", """The Mirror", """Stalker", """Nostalgia", """The Sacrifice" ] }
 }
}'
{"_"index":"""imdb","_""type":"""director","_""id":"""1","_""version":"1,"created":"true}

As you can see, we provided the input, output, and payload properties for the completion_suggest field. The following parameters are supported:

  • input: This field stores the input. It can be an array of strings or just a string. This field is required.
  • output: This field stores a string to return when a suggestion matches. This field is optional.
  • payload: This field stores a JSON object to return additional information about your document as arbitrary and is optional.
  • weight: This field stores a positive integer or a string containing a positive integer value to define to define a weight related to the document. It allows you to rank your suggestions and is optional.

Get suggestions

If we would like to find documents that have directors starting with tar, we would run the following command:

curl -XGET localhost:9200/imdb/_suggest?pretty -d '{
 "directorAutocomplete":" {
  "text": """tar","
  "completion":" {
   "field": """completion_suggest"
  }
 }
}'

The result returned by Elasticsearch for the preceding query looks as follows:

{
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "directorAutocomplete":" [
      {
         "text": """tar","
         "offset":" 0,
         "length":" 3,
         "options":" [
            {
               "text": """Andrei Arsenyevich Tarkovsky","
               "score":" 1,
               "payload":" {
                  "movies":" [
                     Ivan's"' Childhood","
                     "Andrei Rublev","
                     "Solaris","
                     "The Mirror","
                     "Stalker","
                     "Nostalgia","
                     "The Sacrifice"
                  ]
               }
            }
         ]
      }
   ]
}

As you can see, the document about Andrei Tarkovsky has been returned to us with the payload information about his movies when we search for the phrase tar because we indexed the phrases andrei, arsenyevich, and tarkovsky in the document's completion field as input values. This is why the phrase tar matched the phrase tarkovsky and the text (that is, Andrei Tarkovsky Arsenyevich) that is indexed as the output value is returned to us with the payload field.

Improving the relevancy of search results

In general, Elasticsearch is used for searching while it is a data analysis tool. In this respect, improving query relevance is an important issue. Of course, searching also means querying and scoring, thus it is a very important part of querying in Apache Lucene as well. We can use the re-scoring mechanism to improve the query’s relevance. In addition to the capabilities of document scoring in the Apache Lucene library, Elasticsearch provides different query types to manipulate the score of the results returned by our queries. In this section, you will find several tips on this issue.

Boosting the query

Boosting queries allow us to effectively demote results that match a given query. This feature is very useful in that we can send some irrelevant records of the result set to the back. For example, we have an index that stores the skills of developers and we're' looking for developers who know the Java language. We use a query such as the following for this case:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "fields": ["""age", """skills", """education_status"],"
  "query":" {
    "match":" {
      "skills": """java"
    }
  }
}'
...
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERYloLvXHAFW5Vn9ct","
            "_"score":" 0.30685282,
            "fields":" {
               "skills":" [
                  "c++","
                  "ruby","
                  "java","
                  "scala","
                  "python"
               ],
               "education_status":" [
                  "graduated"
               ],
               "age":" [
                  26
               ]
            }
         },
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERZkNpvXHAFW5Vn9jo","
            "_"score":" 0.30685282,
            "fields":" {
               "skills":" [
                  "java","
                  "jsf","
                  "wicket","
                  "scala","
                  "python","
                  "play","
                  "spring"
               ],
               "education_status":" [
                  "student"
               ],
               "age":" [
                  22
               ]
            }
         },
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERXyjCvXHAFW5Vn9W9","
            "_"score":" 0.30685282,
            "fields":" {
               "skills":" [
                  "c","
                  "java","
                  "spring","
                  "spring mvc","
                  "node.js"
               ],
               "education_status":" [
                  "graduated"
               ],
               "age":" [
                  27
               ]
            }
         }

What can we do if there are some documents returned that that we don’t much care as much as about than other documents and what can we do in order to discover the most relevant records first while browsing through the data? For example, we want to prioritize students. Reducing the score of documents that have unwanted terms could be a solution. You can specify negative rules in a bool query. In this case, the documents containing unwanted terms are still returned, but their overall scores are reduced. To send such a query to Elasticsearch, we will use the following command:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query":" {
    "boosting":" {
      "positive":" {
        "match":" {
          "skills": """java"
        }
      },
      "negative":" {
        "match":" {
          "education_status": """graduated"
        }
      },
      "negative_boost":" 0.2
    }
  }
}'
...
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERZkNpvXHAFW5Vn9jo","
            "_"score":" 0.30685282,
            "_"source":" {
               "name": """Hüseyin Akdoğan","
               "age":" 22,
               "skills":" [
                  "java","
                  "jsf","
                  "wicket","
                  "scala","
                  "python","
                  "play","
                  "spring"
               ],
               "education_status": """student"
            }
         },
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERYloLvXHAFW5Vn9ct","
            "_"score":" 0.061370563,
            "_"source":" {
               "name": """Hüseyin Akdoğan","
               "age":" 26,
               "skills":" [
                  "c++","
                  "ruby","
                  "java","
                  "scala","
                  "python"
               ],
               "education_status": """graduated"
            }
         },
         {
            "_"index": """my_index","
            "_"type": """talent","
            "_"id": """AVERXyjCvXHAFW5Vn9W9","
            "_"score":" 0.061370563,
            "_"source":" {
               "name": """Hüseyin Akdoğan","
               "age":" 27,
               "skills":" [
                  "c","
                  "java","
                  "spring","
                  "spring mvc","
                  "node.js"
               ],
               "education_status": """graduated"
            }
         }

As you can see, the score of the document whose education_status field value is student is the same as the previous query result, but the scores of the last two documents have been decreased by 80 %.% The reason has been changed in terms of the value of the negative boost. We set its value to 0.2 in the preceding command.

Bool query

The bool query allows us to use Boolean combinations in nested queries. It provides a should occurrence type that defines no must clauses in a Boolean query (of course, this behavior can be changed by setting the minimum_should_match parameter), but each matching should clause increases the document score. This feature is very useful when you want to move some results among the result set to the forefront. For example, we have an index that stores technical articles and we're looking for articles written about Docker. We use a query like the following for this:

curl -XGET localhost:9200/my_index/_search -d '{
  "query":" {
   "multi_match":" {
     "query": """docker","
     "fields": ["_""all"]"
   }
  }
}'
...
         {
            "_"index": """my_index",
            "_"type": """article","
            "_"id": """AVETmMSTOCXTx0WbQQh1","
            "_"score":" 0.13005449,
            "_"source":" {
               "title": """9 Open Source DevOps Tools We Love","
               "content": """We have configured Jenkins to build code, create Docker containers..."
            }
         },
         {
            "_"index": """my_index","
            "_"type": """article","
            "_"id": """AVETl_kKOCXTx0WbQQga","
            "_"score":" 0.111475274,
            "_"source":" {
               "title": """Using Docker Volume Plugins with Amazon ECS-Optimized AMI","
               "content": """Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management services..."
            }
         }
...

As you can see, the first document seems less relevant for docker compared to the the second document. In this case, we can use a should clause, plus we can use the boost parameter to improve the relevancy of our search results. The boost parameter allows us to increase the weight of the given fields. Thus, it tells Elasticsearch that some fields are more important than other fields when performing term matching. If the title field contains the term that we're' looking for, the document is relevant. This assessment is not wrong. Therefore in our example, the important field is title. We could run the following command as an another example:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query":" {
    "bool":" {
      "must":" [
        {
          "match":" {
            "_"all": """docker"
          }
        }
      ],
      "should":" [
        {
          "match":" {
            "title":" {
              "query": """docker","
              "boost":" 2
            }
          }
        }
      ]
    }
  }
}'

Okay, let's' now look at the example response::

...
         {
            "_"index": """my_index","
            "_"type": """article","
            "_"id": """AVETl_kKOCXTx0WbQQga","
            "_"score":" 0.33130926,
            "_"source":" {
               "title": """Using Docker Volume Plugins with Amazon ECS-Optimized AMI","
               "content": """Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management services..."
            }
         },
         {
            "_"index": """my_index","
            "_"type": """article","
            "_"id": """AVETmMSTOCXTx0WbQQh1","
            "_"score":" 0.018529123,
            "_"source":" {
               "title": """9 Open Source DevOps Tools We Love","
               "content": """We have configured Jenkins to build code, create Docker containers..."
            }
         }
...

As you can see, the first document returned is now more relevant with regard to the should clause and the boost parameter.

Synonyms

TR relates to Turkey, or a search for Jeffrey Jacob Abrams also relates to J.J. Abrams. The simpler and more subtle the changes, the easier it is for human beings to notice this similarity. However, the machines need assistance here. Synonyms allow us to ensure that documents are found with terms of the same/similar meanings in this regard. In other words, they are used to broaden the scope of what is considered as a matching document. Now let's' examine the following example:

curl -XPUT localhost:9200/travel -d '{
  "settings":" {
    "analysis":" {
      "filter":" {
        "tr_synonym_filter":" {
          "type": """synonym","
          "synonyms":" [
            "tr,turkey"
          ]
        }
      },
      "analyzer":" {
        "tr_synonyms":" {
          "tokenizer": """standard","
          "filter":" [
            "lowercase","
            "tr_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings":" {
    "city":" {
      "properties":" {
        "city":" {
          "type": """string", """analyzer": """tr_synonyms"
        },
        "description":" {
          "type": """string", """analyzer": """tr_synonyms"
        }
      }
    }
  }
}'

We created a travel index using the tr_synonyms analyzer. It is configured with the synonym token filter whose name is tr_synonym_filter. The tr_synonym_filter handles synonyms during the analysis process. Its synonyms parameter accepts an array of synonyms that were provided by us. The only element of the array says that tr is a synonym of turkey and vice versa. Now let's add a document to the index:

curl -XPOST localhost:9200/travel/city -d '{
  "city": """Istanbul","
  "description": """Istanbul is the most populous city in Turkey."
}'
{"_"index":"""travel","_""type":"""city","_""id":"""AVEXOA_xXNtV9WrYCpuZ","_""version":"1,"created":"true}
Now, let us search tr phrase on travel index:
curl -XGET localhost:9200/travel/_search?pretty -d '{
  "query":" {
    "match":" {
      "description": """tr"
    }
  }
}'
{
   "took":" 12,
   "timed_out":" false,
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "hits":" {
      "total":" 1,
      "max_score":" 0.13561106,
      "hits":" [
         {
            "_"index": """travel","
            "_"type": """city","
            "_"id": """AVEXOA_xXNtV9WrYCpuZ","
            "_"score":" 0.13561106,
            "_"source":" {
               "city": """Istanbul","
               "description": """Istanbul is the most populous city in Turkey."
            }
         }
      ]
   }
}

As you can see, the document that we're' looking for was returned to us because the tr_synonym_filter handles synonyms by means of the synonyms provided that were defined by us.

Be careful about the _all field

Elasticsearch allows you to search in all the fields of a document. This facility is provided by the _all field, because it includes the text of one or more other fields within the document indexed and concatenates them into one big string. This feature is very useful when want to use a full-text search. However, due to the structure of the field, we may not produce the expected results when searching on this field. For example, let's change the query to run on the _all field we used in our previous example:

curl -XGET localhost:9200/travel/_search?pretty -d '{
  "query":" {
    "match":" {
      "_"all": """tr"
    }
  }
}'
{
   "took":" 15,
   "timed_out":" false,
   "_"shards":" {
      "total":" 5,
      "successful":" 5,
      "failed":" 0
   },
   "hits":" {
      "total":" 0,
      "max_score":" null,
      "hits":" []
   }
}

As you can see, no document was returned to us in the query results. This is because the _all field combines the original values from each field of the document as a string. In our previous example, the _all field only included these terms: [istanbul, is, the, most, populous, city, in, turkey].

So, similar words did not appear in this field. Another important point to note is that the _all field is of the type string. This means that the fields' values of different types are stored as a string type. For example, if we have a date field whose value is 2002-11-03 00:00:00 UTC, the _all field will contain the terms [2003, 11, and 03]..

Summary

In this article, we looked at the Suggest API and saw how we can use term, phrase, and completion suggesters with their configuration details. Then we looked at the various functionalities to improve the relevancy of search results provided by Elasticsearch. We looked at how we can broaden the scope of matching documents with the synonym facility. Finally, we tried to correctly understand the notion of the _all field in depth.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Elasticsearch Indexing

Explore Title
comments powered by Disqus