Apache Solr High Performance — Save 50%
Boost the performance of Solr instances and troubleshoot real-time problems with this book and ebook
In this article, by Surendra Mohan, the author of Apache Solr High Performance, we will learn about the dismax query parser in detail. Also, we will go through what are boost queries.
(For more resources related to this topic, see here.)
The dismax query parser
Before we understand how to boost our search using the dismax query parser, we will learn what a dismax query parser is and the features that make it more demanding than the Lucene query parser.
While using the Lucene query parser, a very vital problem was noticed. It restricts the query to be well formed, with certain syntax rules that have balanced quotes and parenthesis. The Lucene query parser is not sophisticated enough to understand that the end users might be laymen. Thus, these users might type anything for a query as they are unaware of such restrictions and are prone to end up with either an error or unexpected search results.
To tackle such situations, the dismax query parser came into play. It has been named after Lucene's DisjunctionMaxQuery, which addresses the previously discussed issue along with incorporating a number of features that enhance search relevancy (that is, boosting or scoring).
Now, let us do a comparative study of the features provided by the dismax query parser with those provided by the Lucene query parser. Here we go:
- Search is relevant to multiple fields that have different boost scores
- The query syntax is limited to the essentiality
- Auto-boosting of phrases out of the search query
- Convenient query boosting parameters, usually used with the function queries
- You can specify a cut-off count of words to match the query
I believe you are aware of the q parameter, how the parser for user queries is set using the defType parameter, and the usage of qf, mm, and q.alt parameters. If not, I recommend that you refer to the Dismax query parser documentation at https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser.
Lucene DisjunctionMaxQuery provides the capability to search across multiple fields with different boosts.
Let us consider the following example wherein the query string is mohan; we may configure dismax in such a way that it acts in a very similar way to DisjunctionMaxQuery. Our Boolean query looks as follows:
fieldX:mohan^2.1 OR fieldY:mohan^1.4 OR fieldZ:mohan^0.3
Due to the difference in the scoring of the preceding query, we may infer that the query is not quite equivalent to what the dismax query actually does. As far as the dismax query is concerned, in such scenarios, (in case of Boolean queries) the final score is taken as the sum for each of the clauses, whereas DisjunctionMaxQuery considers the highest score as the final one. To understand this practically, let us calculate and compare the final scores in each of the following two behaviors:
Fscore_dismax = 2.1 + 1.4 + 0.3 = 3.8 Fscore_disjunctionMaxQuery = 2.1 (the highest of the three)
Based on the preceding calculation, we can infer that the score produced out of the dismax query parser is always greater than that of the DisjunctionMaxQuery query parser; hence, there is better search relevancy provided that we are searching for the same keyword in multiple fields.
Now, we will look into another parameter, which is known as tie, that boosts the search relevance even further. The value of the tie parameter ranges from 0 to 1, 0 being the default value. Raising this value above 0 begins to favor the documents that match multiple search keywords over those that were boosted higher. Value of the tie parameter can go up to 1, which means that the score is very close to that of the Boolean query. Practically speaking, a smaller value such as 0.1 is the best as well as an effective choice we may have.
Let us assume that a user searches for Surendra Mohan. Solr interprets this as two different search keywords, and depending on how the request handler has been configured, either both the terms or just one would be found in the document. There might be a case wherein one of the matching documents Surendra is the name of an organization and they have an employee named Mohan. It is quite obvious that Solr will find this document and it might probably be of interest to the user due to the fact that it contains both the terms the user typed. It is quite likely that the document field containing the keyword Surendra Mohan typed by the user represents a closer match to the document the user is actually looking for. However, in such scenarios, it is quite difficult to predict the relative score, though it contains the relevant documents the user was looking for.
To tackle such situations and improve scoring, you might be tempted to quote the user's query automatically; however, this would omit the documents that don't have adjacent words. In such a scenario, dismax can add a phrased form of the user's query onto the entered query as an optional clause. It rewrites the query as follows:
This query can be rewritten as follows:
+(Surendra Mohan) "Surendra Mohan"
The rewritten query depicts that the entered query is mandatory by using + and shows that we have added an optional phrase. So, a document that contains the phrase Surendra Mohan not only matches that clause in the rewritten query, but also matches each of the terms individually (that is, Surendra and Mohan). Thus, in totality, we have three clauses that Solr would love to play around with.
Assume that there is another document where this phrase doesn't match, but it has both the terms available individually and scattered out in there. In this case, only two of the clauses would match. As par Lucene's scoring algorithm, the coordination factor for the first document (which matched the complete phrase) would be higher, assuming that all the other factors remain the same.
Configuring autophrase boosting
Let me inform you, autophrase boosting is not enabled by default. In order to avail this feature, you have to use the pf parameter (phrase fields), whose syntax is very much identical to that of the qf parameter. To play around with the pf value, it is recommended that you start with the same value as that of qf and then make the necessary adjustments.
There are a few reasons why we should vary the pf value instead of qf. They are as follows:
- The pf value helps us to use varied boost factors so that the impact caused due to phrase boosting isn't overwhelming.
- In order to omit fields that are always a single termed, for example, identifier, due to the fact that in such a case there is no point in searching for phrases.
- To omit some of the fields having numerous text count in order to retain the search performance to a major extent.
- Substitute a field with the other having the same data, but are analyzed differently. You may use different text analysis techniques to achieve this, for example, Shingle or Common-grams. To learn more about text analysis techniques and their usage, I would recommend you to refer to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
Configuring the phrase slop
Before we learn how to configure the phrase slop, let us understand what it actually is. Slop stands for term proximity, and is primarily used to factorize the distance between two or more terms to a relevant calculation. As discussed earlier in this section, if the two terms Surendra and Mohan are adjacent to each other in a document, that document will have a better score for the search keyword Surendra Mohan compared to the document that contains the terms Surendra and Mohan spread individually throughout the document. On the other hand, when used in conjunction with the OR operator, the relevancy of documents returned in the search results are likely to be improved. The following example shows the syntax of using slop, which is a phrase (in double quotes) followed by a tilde (~) and a number:
Dismax allows two parameters to be added so that slop can be automatically set; qs for any input phrase queries entered by the user and ps for phrase boosting. In case the slop is not specified, it means there is no slop and its value remains 0. The following is the sample configuration setting for slop :
<str name="qs" >1</str> <str name="ps">0</str>
Boosting a partial phrase
You might come across a situation where you need to boost your search for consecutive word pairs or even triples out of a phrase query. To tackle such a situation, you need to use edismax, and this can be configured by setting pf2 and pf3 for word pairs and triples, respectively. The parameters pf2 and pf3 are de fi ned in a manner identical to that of the pf parameter. For instance, consider the following query:
how who now cow
This query becomes:
+(how who now cow) "how who now cow" "how who" "who now" "now cow" "how who now" "who now cow"
This feature is unaffected by the ps parameter due to the fact that it is only applicable to the entire phrase boost and has no impact on partial phrase boosting.
Moreover, you may expect better relevancy for longer queries; however, the longer the query, the slower its execution. To handle this situation and make the longer queries execute faster, you need to explore and use text analysis techniques such as Shingle or Common-grams.
Apart from the other boosting techniques we discussed earlier, boost queries are another technique that impact the score of the document to a major extent. Implementing boost queries involves specifying multiple additional queries using the bq parameter or a set of parameters of the dismax query parser. Just like the autophrase boost, this parameter(s) gets added to the user's query in a very similar fashion. Let us not forget that boosting only impacts the scores of the documents that already matched the user's query in the q parameter. So, to achieve a higher score for a document, we need to make sure the document matches a bq query.
To understand boost queries better and learn how to work with them, let us consider a realistic example of a music composition and a commerce product. We will primarily be concerned about the music type and the composer's fields with the field names wm_type and wm_composer, respectively. The wm_type field holds the Orchestral, Chamber, and Vocal values along with others and the wm_composer field holds the values Mohan, Webber, and so on.
We don't wish to arrange the search results based on these parameters, due to the fact that we are targeting to implement the natural scoring algorithm so that the user's query can be considered relevant; on the other hand, we want the score to be impacted based on these parameters. For instance, let us assume that the music type chamber is the most relevant one, whereas vocal is the least relevant. Moreover, we assume that the composer Mohan is more relevant than Webber or others. Now, let us see how we can express this using the following boost query:
<str name="bq">wm_type:Chamber^2 (*:* -wm_type:Vocal)^2 wm_ composer:Mohan^2</str>
Based on the search results for any keyword entered by the user (for instance, Opera Simmy), we can infer that our boost query did its job successfully by breaking a tie score, wherein the music type and composer names are the same with varied attributes.
In practical scenarios, to achieve a better and desired relevancy boost, boosting on each of the keywords (in our case, three keywords) can be tweaked by examining the debugQuery output minutely. In the preceding boost query, you must have noticed (*:* -wm_type:Vocal)^2, which actually boosts all the documents except the vocal music type. You might think of using wm_type:Vocal^0.5 instead, but let us understand that it would still add value to the score; hence, it wouldn't be able to serve our purpose. We have used *:* to instruct the parser that we would like to match all the documents. In case you don't want any document to match (that is, to achieve 0 results), simply use -*:* instead.
Compared to function queries, boost queries are not much effective, primarily due to the fact that edismax supports multiplied boost, which is obviously demanding compared to addition. You might think of a painful situation wherein you want an equivalent boost for both the Chamber wm_type and Mohan wm_composer types. To tackle such situations, you need to execute the query with debugQuery enabled so as to analyze the scores of each of the terms (which is going to be different). Then, you need to use disproportionate boosts so that when multiplied by their score (resultant scores from debugQuery) ends up with the same value.
This article briefly described scoring and function queries. It also gave an idea about the Lucene's DisjunctionMaxQuery.
Resources for Article:
- Getting Started with Apache Solr [Article]
- Apache Solr: Analyzing your Text Data [Article]
- Apache Solr: Spellchecker, Statistics, and Grouping Mechanism [Article]
|Boost the performance of Solr instances and troubleshoot real-time problems with this book and ebook|
eBook Price: $20.99
Book Price: $34.99
About the Author :
Surendra Mohan, who has served a few top-notch software organizations in varied roles, is currently a freelance software consultant. He has been working on various cutting-edge technologies like Drupal, Moodle, Apache Solr, ElasticSearch, Node.js, and more for the past 10 years or so. He also delivers technical talks at various community events like Drupal Meetups and Drupal Camps. To know more about him, his write-ups, technical blogs and much more, check out his site at http://www.surendramohan.info/.
He has also written the books Administrating Solr and Apache Solr High Performance published by Packt Publishing, and has reviewed other technical books, such as Drupal 7 Multi Site Configuration, Drupal Search Engine Optimization, titles on Drupal commerce, ElasticSearch, Drupal related video tutorials, title on OpsView, and many more.
Additionally, he writes technical blogs and articles for SitePoint.com. His published blogs and articles can be found at http://www.sitepoint.com/author/smohan/.